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Abstract 


The  direct  solution  of  large,  sparse  unsymmetric  sets  of 
simultaneous  equations  is  commonly  involved  in  the  numerical 
solution  of  algebraic,  differential,  and  partial  differential 
equations.  This  report  describes  two  new  classes  of  computa- 
tional algorithms  for  the  solution  of  such  equations.  Each 
algorithm  detects  matrix  structure  suitable  for  vector  processing 
and,  potentially,  for  faster  processing  on  cache  machines.  One  nro- 
cedure  favors  structure  usually  associated  with  small  sparse 
matrices;  one  is  directed  toward  sets  of  equations  requiring  a 
large  backing  store.  Comparisons  of  timing  (on  a cache  machine) 
and  of  memory  requirements  are  made  between  these  new  procedures 
and  existing  general  sparsity  techniques  for  a variety  of 
science- engineering  examples.  Issues  related  to  implementation 
are  discussed.  Finally,  flowcharts  and  other  user  information 
are  given  for  software  implementations  of  the  two  algorithms. 
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Preface 

This  report  considers  a number  of  issues  related  to  the 
vectorization  and  partitioning  of  general  sparse  matrix 
solution  algorithms  that  have  not  previously  appeared  in  the 
sparse  matrix  literature.  Typically,  these  involve  processor 
modeling  and  algorithm  analysis,  design,  and  evaluation, 
particularly  as  they  relate  to  the  implementation  and  use  of  a 
software  package  prepared  by  the  authors.  With  this 
rather  broad  topical  coverage  - perhaps  suitable  for  several 
reports  - it  is  felt  to  be  useful  to  present  the  major  results 
with  page  references,  for  the  guidance  of  readers  with  specific 
research  or  software  interests. 

Vectorization 

1.  The  sparse  solution  is  decomposed  into  symbolic  and 
numeric  phases;  in  contrast  to  previously  proposed  scalar 
algorithms,  this  vectorized  version  results  in  relatively 
less  symbolic  processing  time  for  large  matrices  (page  21). 

2.  The  average  length  (Laye)  of  vectors  processed  in 
the  inner  loop  of  the  numeric  phase,  together  with  the  nature 

of  the  processor  (scalar  or  vector),  determine  the  relative  effi- 
ciency of  the  vector  algorithm.  This  vector  algorithm  may  be 
preferred  even  on  scalar  processors  if  L is  sufficiently 

3.  V0 

large  (pages  21-25). 

3.  Timing  comparisons  with  a variety  of  other  sparse 
solvers  are  given  for  a family  of  finite  element  problems 
(Appendix  A) . 


(i) 


1.  I/O  transfer  tine  to  a backing  store  is  an  important 
issue,  especially  for  the  forward  and  back  substitution  steps 
however,  careful  accounting  in  the  symbolic  processing  phase 
can  be  used  to  predict  when  computation  is  I/O  bound  or  when 
the  cost  of  local  store  in  a virtual  system  becomes  pro- 
hibitive (pages  33-38). 

2.  The  megaflop  rate,  popular  in  evaluating  vector  pro- 
cessors, can  be  used  to  succinctly  display  the  efficiency  of 
partitioned  general  sparsity  algorithms  vis-a-vis  special  full 
band-  and  block- solvers  (pages  47-49) 

Implementat ion 

1.  A partitioning  of  the  numeric  but  not  the  symbolic 
phase  is  proposed;  pros  and  cons  of  this  choice  are  discussed 
(pages  72-  73)  . 

2.  Aids  to  equation  formulation,  although  strictly  not 
required  in  sparse  equation  software,  are  proposed  to  avoid 
cumbersome  data  structure  and  structural  ordering  and  over- 
lap problems  (pages  58-69). 

3.  An  interactive  partitioning  scheme,  based  on  user 
specification  of  either  column  breaks  or  maximum  buffer 
storage,  is  proposed  (pages  77-82). 

4.  Examples  - including  Fortran  code  - of  use  of  a soft- 
ware package  on  partitioned  and  unpartitioned  problems  arc 
given  (pages  56.104-105). 


r- 


5.  Timing  results  of  applying  two  new  sparse  solvers  to 
problems  in  electrical  power  systems,  rigid  body  dynamics, 
and  electronic  devices,  and  to  a family  of  finite  element 
problems  are  presented  (Appendix  A) . 
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CHAPTER  1.  SPARSE  MATRIX  METHODS 


A . Introduction 

1.  Historical  review  of  direct  sparse  matrix  algorithms 

Prior  to  1972,  sparse  systems  of  s imulataneous  equations 
tended  to  result  from  the  solution  of  one  of  the  following  classes 
of  equations: 

(a)  partial  differential  equation  (PDE's),  where,  after 
discretization  of  the  spatial  variables,  using  finite  difference 
(FD)  or  finite  element  (FE)  methods,  a regularly- structured  matrix 
was  solved  using  iterative  techniques. 

(b)  ordinary  differential  equations  (ODE's)  or  algebraic 
equations,  where,  after  time  discretization  and/or  linearization,  an 
irregularly  structured  matrix  was  solved  using  direct  methods. 

In  1972,  George  [1]  showed  that,  using  dissection,  a square 
n by  n grid  obtained  from  solution  of  PDE's  by  5-point  discretization 
formulae  could  be  solved  directly  in  O(n^)  time  rather  than  O(n^), 
a prohibitive  cost  associated  with  band- solution  methods.  Later, 

Woo  and  Gustavson  [2]  derived  an  ordering  of  the  grid  points  which 

made  dissection  faster  than  band  methods  for  n greater  than  10. 

2 

Considering  that  iterative  methods  require  0(n  ) time  per  iteration, 
these  results  made  it  feasible  for  the  first  time  to  solve  many 
"small"  PDE  problems  directly,  leaving  iterative  methods  only  for 
large  problems  (perhaps  n greater  than  50)  having  special  structural 
and  numerical  properties.  Indeed,  in  [2]  a variety  of  iterative 
and  direct  solution  methods  are  compared  to  determine  the  value  of 
n for  which  equal  amounts  of  "work"  are  required.  Very  recently, 

Bank  [3]  showed  that  certain  classes  of  FD  problems  can  be  solved 


2 

in  0(n  ) time. 


2.  General  vs.  special  sparsity  algorithms 

« 

When  the  matrix  structure  is  highly  regular  and  the 
number  of  spatial  discretizations  in  the  shortest  dimension  is 
small  (3  15),  easily  constructed  band  elimintation  algorithms 
may  be  fully  justified  computationally.  However, 

(a)  as  the  grid  size  grows  in  a FD  or  FE  problem 
a dissection  strategy  is  called  for; 

(b)  as  structural  irregularity  is  introduced  by 
curved  or  odd-shaped  boundaries,  by  the  use  of  a family  of  finite 
elements,  or  by  implicit  boundary  conditions  resulting  from 

algebraic  or  ordinary  differential  equations  of  a physically-  ••  j 

. 

connected  external  system,  a band  algorithm  becomes  progressively 
less  efficient;  examples  of  the  latter  mixed  PDE/ODE/algebraic 


systems  are  shown  in  Table  1; 


Partial  Dif.  Eqns. 
(regular  sparsity) 

Ordinary  Dif.  and 
Algebraic  Eqns. 
(irregular  sparsity) 

Structures 

Mprhan  i <;m ^ 

Semiconductor 

f.i  rrni  tc; 

Devices 

Combustion 

F.m  i i nn 

Table  1.  Related  applications 

* 

yielding  mixed  sparsity  structures 

(c)  when  a vector  computer  architecture  or  a 
backing  store  is  necessitated  by  a large  problem  size,  specialized 
programming  techniques  may  by  required  to  achieve  acceptable 
solution  efficiency. 

Dissection- related  software  to  implement  (a)  exists  in  a 
variety  of  forms.  Early  programs  required  grid  sizes  related 
to  a power  of  2 and  involved  data  structures  that  require  recoding 
of  the  equation  formulation  step  in  a band-related  solution. 

This  complication  induced  George  in  a later  publication  [4] 
to  consider  less  efficient  dissection  methods  that  retain  some 

of  the  simplicity  of  band  matrix  programming. 

General  sparsity  methods  - where  an  arbitrary  matrix 

structure  is  allowed  - offer  an  alternative  to  both  banded  and 
dissected  grid  solution  methods.  General  sparsity  software  can 
accept  a matrix  formulation  in  any  order,  since  the  equations 
can  be  reordered  internally  by  the  sparsity  software  to  correspond 
to  a banded  or  a dissected  solution.  Usually,  the  principal 
price  extracted  of  the  user  for  this  generality  is  a preprocessing 
step  where  the  matrix  structure  is  analyzed  and  an  efficient 
numerical  solution  algorithm  prepared  specfically  for  that 
structure.  The  introduction  of  additional  structural  irregularity 
for  any  of  the  reasons  cited  in  (b)  above  is  then  at  least 
conceptually  trivial. 

3.  Summary  of  report 

The  above  rationale  for  use  of  general  sparsity  software 
would  apply  to  any  of  a number  of  existing  software  packages 


(3) 


[ 5 ] , [ 6 ] . The  aim  of  this  report  is  to  expand  the  role  of 
such  algorithms  by 

(a)  proposing  alternate  data  structures  that 
improve  the  storage  efficiency  of  existing  software  when  solving 
large  systems  of  equations  associated  with  large  FD  and  FE  problems; 

(b)  including  the  optional  use  of  a backing 
store  when,  because  of  problem  size,  either  real  memory  capacity 
of  a dedicated  system  is  exceeded  or  total  memory  costs  become  a 
significant  factor  in  a virtual  environment: 

(c)  recognizing  local  structure  in  the  matrix 
that  can  be  exploited  by  a vector  processor  and,  incidentally,  can 

improve  solution  times  on  cache  "scalar"  processor  as  well. 

Two  increasingly  sophisticated  software  pacakages  are  des- 
cribed which  permit  the  reader  to  evaluate  the  programming 

effort  involved  in  utilizing  general  sparsity  techniques. 

As  the  problem  size  grows,  backing  store  and  special  architectures 

can  be  brought  into  use  with  no  change  in  the  equation  formulation 
and  a minimum  of  additional  programming  effort. 

B . Sparse  Matrix  Factorization 

1.  Matrix  factorization 


Consider  the  equation 
A x = b 

where  A = [a-.]  „ and  b = [b,.lv1  contain  constant  coefficients 

— ij  nxn,  — ljiiAi 


(4) 


and  x = [ x - - ] , is  a vector  of  unknowns.  We  will  choose  to 

1 J IlA  X 

factor  A into  the  triangular  form 

A = L U 

where  L = a lower  triangular  matrix,  and  U = [ u - - ] , 

an  upper  triangular  matrix,  and  either  = 1 for  1 < i ^ n or 

u^  = 1 for  1 < i < n.  In  general,  L and  U will  contain  the  same 
non-zero  positions  as  A,  plus  "fill"  positions  created  by  the 
elimination  process. 

2.  Software  implementation 

The  use  of  specialized  algorithms  and  data-handling 
methods  for  sparse  equations  is  almost  a decade  old  [7]. 

The  random  structure  is  usually  described  by  a bit  map  or  a 
linked  list.  This  structure  is  then  often  preprocessed  symbolic- 
ally to  reduce  the  computation  in  a subsequent  (repeated) 
numerical  solution  phase  (Figure  1).  The  three  common  two-step 
(symbolic/numeric)  solution  methods  are: 

(a)  code  generation  [ 8 ] , [ 6 ] , where  a large  set 
of  explicit  machine  instructions  - including  array  indices  - 

are  generated  in  the  symbolic  step;  these  instructions  "map"  the 

given  A and  b into  x during  the  numerical  solution  step;  an 

instruction  must  be  generated  for  each  arithmetic  operation,  so 

that  if  a full  matrix  were  being  processed  O(n^)  instructions 

3 

would  be  required  corresponding  to  the  0(n  ) arithmetic  operations; 

(b)  interpretive  index  generation  [ 9 ] , [6]  similar 
to  (a)  except  that  the  code  is  in  the  form  of  array  indices  and 
higher  level  instructions;  for  example,  the  instructions  might 
specify  row/column  operations  such  as  inner  and  outer  product  and 


(5) 


Matrix  Structure 


I 


Pivotl 

Symbolic 

Order|~" 

| Preprocessor 

Machine  code,  or 
interpretable  indices,  or 
L U map 


Figure  1.  Model  symbolic  and  numerical  solution  procedure 

the  indices  would  specify  the  non-zero  positions  of  the  row/column; 
less  storage  is  required  than  (a)  - usually  a factor  of  5 or  10  - 
but  execution  is  commonly  3-5  times  slower; 

(c)  LU  map  approach  [10][12]  where  the  map  of  L and  U 
(i.e.,  A and  its  fills)  is  determined  by  the  preprocessor;  for 
a full  matrix,  only  0(n“)  storage  is  required  for  the  map  of  L 
and  U,  the  same  as  for  the  numerical  values  of  the  matrix  itself; 
this  procedure  appears  to  be  as  fast  as  the  method  of  (b) (see 
Table  A3)  and  for  these  reasons  (b)  is  not  often  used. 

Detailed  speed  and  storage  comparisons  of  (a)  and  (c)  for 
two  FD  examples  are  given  in  [2][ll]. 

In  this  report,  the  LU  map  approach  is  chosen  because  of  its 
reduced  storage  requirements.  The  storage  map  is  further  reduced 
by  compressing  the  data  structure  for  adjacent  non-zero  matrix 
positions,  i.e.,  dense  segments  of  rows/columns.  Thus,  the  storage 
of  the  structure  of  a full  matrix  would  be  0(n),  since  each  row/ 
column  could  be  specified  by  the  address  of  the  first  row/column 
position  and  by  the  row/column  length  (n) . 


(6) 


It  will  be  shown  that  recognition  of  such  structure  allows  increased 
solution  efficiency  for  cache  "scalar"  machines  and  can  be  expected 
to  vastly  improve  the  performance  of  vector  processors.  Besides 
preparing  a map  for  the  numerical  solution,  the  symbolic  preproces- 
sor - far  faster  than  the  numerical  solver  for  large  matrices  - 
is  useful  for  predicting  solution  times  and  for  optimizing  the  auto- 
matic or  interactive  partitioning  of  the  solution  when  a backing 
store  is  used. 

C . Vectorization 

1.  Vectors  and  vector  instructions 

Closely  associated  with  the  condensation  of  the  storage 
map  of  L and  U by  recognition  of  higher  level  structure  is  the 
efficient  processing  of  this  structure  by  exploiting  two  features 
of  modern  scientific  processors  - pipelining  and  parallelism. 

The  following  discussion  of  the  use  of  these  features  will  be 
quite  simplified;  further  descriptions  and  rationales  will  be 
presented  in  Chapter  2. 

We  will  use  the  word  vector  to  mean  an  array  of  data.  Thus, 
a vector  operation  is  an  operation  on  arrays.  This  includes 
row/column  operations  on  matrices,  operations  on  submatrices 
(blocks) , etc . 

For  purposes  of  later  reference,  we  distinguish  between 
(a)  a simple  vector  operation,  to  replace  the 
Fortran-like  operations 

DO  1 M = 1,  N 

1 C C I j (M) ) = A ( 1 2 (M) ) o B(I3(M)) 

(7) 


where  o indicates  an  arithmetic  or  logical  operation  and  where 
I^(M),  I^CM),  and  I ^ CM)  are  Linear  indexing  functions  of  the 
form  I.  = a.  + Mg.  where  a.  and  g.  are  arbitrary;  thus,  with 
o = +,  oti  = 0,  g^  = 1,  the  arrays  A and  B are  added  to  produce 
the  array  C.  If  o = *,  1^  = M,  = 1,  and  1^  = M,  a "broad- 
cast" multiply  (multiplication  of  a vector  by  a scalar)  results. 

(b)  a higher  level  vector  operation  to  replace 
the  triple  (or  double)  loop 


DO 

1 

J = 1, 

Nl 

DO 

1 

K = 1, 

N2 

DO 

1 

L = 1, 

N3 

1 C(I1(J),I2(K),I3(L))  = A(I4(J),I5(K),I6(L))  o B ( I 7 ( J) , 1 8 (K)  , 

I9CI0) 

where  I^(M)  = + Mg^. 

Computationally,  the  important  characteristic  common  to 
(a)  and  (b)  is  that  only  one  vector  "startup"  is  required  in  each 
case.  This  startup  may  include  time  to  determine  Nl,  N2,  N3 
in  a convectional  (scalar)  machine  and/or  to  fill  the  arithmetic 
pipeline  in  a vector  processor.  In  contrast,  the  "operate" 
time  includes  those  operations  that  must  be  repeated  for  each 
pass  through  the  (lowest  level)  loop,  and  will  include  floating 
point  arithmetic,  in  addition  to  array  indexing  and  loop  termination 
test,  depending  on  the  machine  architecture.  In  general,  fewer 
vector  startups  result  in  less  time  devoted  to  overhead  calculations 
and  thus  a higher  overall  efficiency. 


(8) 


f 


2.  Vectorization  efficiency 


This  efficiency  may  be  quantified  as  follows.  The 
computation  time  to  perform  the  i^*1  vector  operation  is 


T.  = T + r • T 
1 Si  1 opi 


(5) 


where  Ts  is  the  startup  time,  T^  the  operate  time,  and  r^ 

the  vector  length.  If  T and  T are  independent  of  i,  then 

si  opi 

the  time  to  perform  m vector  operations  is 


m 

T = mT  + T Z r. 
s °Pi  = l 1 


Since  the  operate  time  TQp  is  the  useful  computation  time,  define 
the  vectorization  efficiency  as 


operate  time 


startup  time  + operate  time 


m 

< VTs><  ^/"O 

m 

lt(Top/Ts)(.E1ri/m’ 
r 1 = 1 


(6) 


m 


I 


Although  T /T  is  a machine  parameter,  the  quantity  . E,r./m  = L 

op  s i=l  i ave 

is  problem-dependent,  and  identifiable  as  the  "average  vector 
length."  Note  that  an  efficiency  of  .5  is  achieved  when  the 
average  vector  length  is  equal  to  the  ratio  Ts/T0p. 

Although  (6)  usually  applies  to  only  a single  class  of 


n 


(V 


I-  A 


'>*'  - ****  «■« 


vector  instructions  (since  T is  assumed  constant)  similar 

opi 

formulae  can  be  derived  from  (5)  when  several  classes  of  vector 
instructions  are  involved-  In  the  factorization  problem,  we 
will  show  that  the  inner  loop  consists  of  a vector  multiply 
and  a vector  subtract,  each  of  the  same  length.  It  is  easily 
demonstrated  that  the  efficiency  obtained  from  (5)  is  then  the 
same  as  for  a single  instruction  with  startup  and  operate  times 
equal  to  the  sum  of  startup  and  operate  times  for  a subtract  and 
multiply.  The  average  vector  length  of  the  single  time- equivalent 


instruction  is  therefore  a useful  concept. 


CHAPTER  2. 


VECTOR I I HD  CENTRAL  SPARSITY  ALGORITHMS 


For  purposes  of  this  discussion,  assume  that  the  equations 

represented  by  A x = h^  cannot  be  locally  decoupled,  so  that  only 

single  rows/columns  may  be  pivoted  upon  at  a time  (see  [13]  for 

multi-row  elimination).  Associated  with  an  n x n matrix,  n 

t h 

pivot  steps  can  be  identified,  each  (rn)  step  involving  a 
division  of  the  r^  row  or  column  by  the  pivot  element  and  a 
sequence  of  multiply- subtract  operations  involving  the  r^ 
row  or  column  and  other  rows  or  columns  of  the  matrix. 


Although  we  will  have  reason  later  to  study  a variety  of 
factorization  algorithms  that  involve  backing  store  and  arch- 
itectural issues,  for  the  present  two  rather  conventional 
procedures  will  be  compared. 

a).  Row-ordering 


where  the  subscripts  on  the  vectors  correspond  to  beginning 

and  ending  column  numbers,  and  where  u^  = a , = a 

& r,m  r,m  ’ r,m  r,m 

The  row-wise  factorization  step  is  described  by 


1 ( r- 1 ) 

-r+l,n  ~r^.  l^r+l ,n 


l,  _ = l 


(r-1) 


— 1 » r — 1 ,r 


The  forward  and  back  substitution  steps  require  the  solution 
of  L Z = b,  U x = x-  Using  the  same  vector  notation  as  (78) 


yj  ■ 


■ (br  • <-<l, r-d  • 


xn  ‘ 


Lr  ^r  ^-r+l,n-*  ' — r+l,n 


The  row  ordering  of  L and  U requires  the  inner  product  of  two 


vectors . 


b) . Column- ordering 


u<k>  - M10 

-J  >r  J ,r 


-r+l,n  r+l,r 


where  u^  = a , = a . The  columnwise  factorization 

m , r m , r ’ m , r m,r 


uoo  ru(k-D 

— k+1 , r I— k+1 , r 


/(k)  p(k-l) 

-r+l,n  -r+l,n 


‘ uk,r  -k+1 ,n 


k = 1,2, ...r-1 


hi 


t = 1 u(r'1] 

-r+1  ,n  -i-  —r+1  ,n 

rr 


(r-1) 


— 1 , r -1 , r 


and  the  forward  and  back  substitution  step  is 


ii°tl  = - 

y(r)  . y(r-D 

*-r+l,n  Ar+l,n 


^r  — r+l,n 


r = 1,2, ...n-1 


x(n)  = y(n-l) 

-1  , n -1 , n 

x = x(n)/u 
xn  n /unn 


K ( r)  _ „(**•*■  1 ) 
-1 , r — 1 , r 


V * 4r>/urr 


x U,  ^ 

r -1 , r- 1 


r = n- 1 , . . . 1 


(14) 


The  choice  of  row-  or  column-ordered  algorithm  can  be  significant 
for  a scalar  processor  (to  be  shown  experimentally  later).  For 
a vector  processor  having  an  efficiently- implemented  inner-product 
instruction,  row  ordering  is  preferred;  however,  a processor 
with  chained  multiply-add  arithmetic  units  such  as  the  Cray-1 
clearly  suggests  use  of  column-ordering.  The  reader  will 
recognize,  however,  that  only  the  forward  and  back  substitutions 
need  be  changed  to  accommodate  row-ordering  since  = 1 and 

1 = 1 in  the  row-  and  column-ordered  methods,  respectively. 

These  options  are  available  in  subroutines  (VMBPR,  ZMBPR)  and 
(VMBPC,  ZMBPC)  respectively  in  the  software  package  of  Chapter  3. 


(13) 


(13) 


B . The  LU  Map  Approach  and  Its  Symbolic  Vector i zat ion 


1 . 1 n t roduc t ion 

The  purpose  of  the  symbolic  phase  of  figure  1 is  to  determine 
the  fill  characteristics  of  A,  i.e.,  the  exact  structure  of  L and 
Li.  This  information  is  used  by  the  numeric  part  to  reduce  the 
solution  time  (Gustavson  [10]  cites  a factor  of  2-3  for  the 
LU  map  approach) . 

To  acquaint  the  reader  with  this  approach  an  example  using 
Gustavson' s "scalar"  map  is  shown  in  Table  2.  Special  note 
should  be  taken  of 

(1)  the  fill  positions  detected  by  the  symbolic  phase  in  the 
generation  of  the  LU  map; 

(2)  the  use  of  map  indices  in  the  numeric  solution  to 
extract  information  from  the  numeric  arrays  A,  L,  and  U; 

(3)  the  use  of  an  expanded  current  column  (X  array),  requiring 
zeroing,  expansion,  and  contraction  in  the  loading  and  storing 
process  (see  [10]  and  page  28  of  this  report  for  alternative  pro- 
cedures) ; 

(4)  the  opportunities  for  the  use  of  (simple)  vector  opera- 
tions in  the  numeric  solution,  as  evidenced  by  the  indexed  array 
operations  marked  "vector". 

It  should  be  pointed  out  that  fill  detection  is  essential  to 
any  sparse  matrix  factorization  algorithm.  The  LU  map  approacli 
exploits  the  fact  that  this  costly  process  need  be  performed  only 
once  for  a given  matrix  structure,  allowing  multiple  solutions 
with  different  numerical  values  - as  occurs  in  a Newton  lineari- 


zation process. 


1 


3 0 0 0 2 


0 4 2 10 


0 2 fa  0 5 


2 0 5 1 5 


0 1/, 
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2 7/10 1 13/10 
13/27  | 67/54 


matrix 


current 

column 

completely- factored  matrix 


from  ( JA 
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rated 

by 

sym- 

bolic 


gene-  L 
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by  I 
nume-  5 U 
ric 

f a c t o - j 
riza-  ij'I 
t ion 


(column-ordered  numeric  values  of  A matrix) 

3,2,4, 2, 1,2, 6, 3, 1,3, 1,2, 3, 1,5 

( JA ( j ) points  to  beginning  of  jth  column  of  A in  1A) 
1,3,6,  >?j,12,16 

(column-ordered  Jiust  of  row  numbers  of  A) 

1 , 5 , 2 , 3 , 4 , !~2  ^5j,  2 , 4 , 5 , 1 , 3 , 4 , 5 

(JL(j)  point_s  to  beginning  of  jth  column  of  L in  IL) 

1 , [ 2 , 4,6J  , 7 

(column^- ordered  list  of  row  numbers  of  L) 

3 9 j_3  9 ^ f j 1 1 

(JU(j)  points  to  beginning  of  jth  column  of  U in  IU) 
1,1, 1,2, 4, 7 

(column-ordered  list  of  row  numbers  of  U) 

2 , >2  ,a.j  , 1 , 3,4 £in 

(column-ordered  numeric  values  of  L) 

2/3, 1/2, 1/4, -1/10, 3/5, 

(column-ordered  numeric  values  of  U) 

2 ,  , »_* 4 

( ordered  nur,,pric  v lues  of  diagonal) 

3,4,5,  , 


(a)  Example  up  to  factorization  of  fourth  column 
fable  2 . Example  of  use  of  LU  map  in  factorization 


1.  Zero  expanded  current  column  (X  array) 

2.  Load  current  column  with  fourth  column  of  A 


X ( 2 ) = 1 

indices  I 

X ( 4 ) = 3” 
X (5)  =1_ 

vector 

from  I A 


i 


5. 


Factorize  fourth  column 


X(3)=X(3)-X(2)*L(2)=0-(1)  (l/2)  = -l/2l 
X(4)=X(4)-X(2)*L(3)=3-(l)(l/4)=ll/4j 


indices 
from  IL 
of  pre- 
vious 
col umn s 


X(4)=X(4)-X(3)*L(4)=ll/4-(-l/2)(-l/10)=2 

X(5)=X(5)-X(3)*L(5)=l-(-l/2)(3/5)=13/10 


starting 
indices 
from  JL 


7/10] 

j 


vector 


UI (4)=l/X(4)=10/27 
X(5)=X(5)*DI(4)=13/27 


4.  Store  current  column 


r 

I 

starting 
indices 
from  JL,JU 


U(2)=X(2)] 

U(5)=X(3)j 

L(6)=X(5) 


vector 


indices  from 
ilL.lU  of 
icurrent  column 


(b)  Steps  in  Factorization  of  Fourth  Column 
lable  2.  Example  of  use  of  Id—  map  in  factorization 


(16) 


■ 


The  details  of  the  symbolic  map  generation  are  1 
Chapter  3.  However,  the  following  two  sections  arc 
to  give  insight  into  this  process  by  discussion  oi  \ 
data  structure  and  symbolic  operations  on  it  dui ing 
zation  process. 


eft  for 
intended 
'ector  i zed 
the  factor i 


(17] 


2.  Vectorized  list  data  structures 

Consider  a column  of  a sparse  matrix  having  the  non-zero 

row  positions  shown  in  figure  2 (before  fill).  This  structure  would 
be  described  in  a conventional  ordered  list  as 

5 1 , 5 2 , . . . 5b  , 39 , 4 2 , 45  , . . . 47  (1  5) 

Such  a list  enumerating  all  row  positions  will  be  termed  scalar 
storage.  Clearly,  the  list  can  be  shortened  by  identifying  sets  of 
contiguous  positions  (vectors)  and  retaining  only  the  first  and  last 
row  numbers,  viz, 

31,36,59,39,42,47  (16) 

This  form  is  natural  to  looping  operations  for  a scalar  processor, 
where  pairs  of  numbers  are  directly  usable  as  upper  and  lower  loop 
indices.  Alternatively,  the  initial  row  position  and  the  vector 
length  could  be  stored  as 

51,6,59,1,42,6  (17) 

This  form  is  favored  by  vector  processors  with  hardware  that  counts 
down  vector  arithmetic  operations  to  terminate  a vector  operation. 

Another  choice,  preferred  when  a significant  number  of  single- 
ton  (scalar)  positions  are  present,  represents  a vector  of  length 
one  with  a minus  sign  prefixing  the  row  number  as 

31,36,-39,42,47  (18) 

This  latter  structure  has  been  adopted  in  this  report. 


(18) 


Vector  fills 


The  mul  t iply-subtract  operation  of  (.12)  can  result  in  pro- 
duction of  fills  that  must  be  detected  in  the  symbolic  phase.  In 
Figure  2,  the  process  of  multiplying  the  kl  column  of  L (termed 

a preceeding  or  recalled  column)  by  u,,  _ and  subtracting  from  the 

k , r 


Before  Fill 


r 


th 


col umn 


k 


th 


column 


(36,43) 
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II 
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31 
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35 

36 

1 1 
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I * 
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43 

44 

45 

46 

47 

(31,36) 


- 39 


(42,47) 


After  Fill 


(31,47) 


31 
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33 

34 

35 

36 

37 

38 

39 
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41 

42 

43 

44 

45 

46 

47 

Figure  2.  Example  of  vector  fill,  with  data  structure 
description  (scalar  indicated  by  - sign) 

f-  v» 

r column  of  L (termed  the  current  column)  and  U is  depicted.  The 
zero-valued  positions  37,38,40,41,  which  initially  separate  two 
vectors  and  a scalar,  are  filled  by  the  dense  vector  (36,43)  in 
the  k^  column. 

The  symbolic  phase  produces  the  LU  map  by  scanning  the  number 
pairs  representing  the  vector  structure  of  all  the  preceeding 
columns  and  the  current  column  to  determine  zero-valued  regions  of 


(19) 


the  latter  covered  by  at  least  one  of  the  former.  These  are  the 
fill  positions. 


Comparison  of  Scalar  and  Vector  Sparsity  Methods 


1.  Introduction 

Although  one  could  expect  to  reduce  storage  by  compacting 
vectors  in  the  data  structure  and  could  hope  to  exploit  the 
structure  of  machines  architected  to  efficiently  process 

looped  instructions,  it  is  less  obvious  that  conventional 
(scalar)  machine  performance  would  benefit  from  vectorization. 

In  this  section  we  study  both  the  storage  and  the  speed 
issues  in  some  detail,  showing  quantitatively  the  advantage  of 
vectorization  even  for  one  of  the  most  recent  scalar  processors. 

2.  Storage 

In  Gustavson' s LU  map  approach,  the  column  ordered  map  of 
L and  U is  saved  in  arrays  IL  and  IU.  Thus,  for  every  numerical 
value  in  L and  U,  there  is  a symbolic  value  in  IL  or  IU  (Table  2) . 

Now  consider  an  LU  map  with  m vectors  of  average  length  d. 

The  scalar  and  vector  maps  require  md  and  2m  locations  respectively, 
exclusive  of  singletons.  Now  consider  the  symbolic  (2-byte)  and 
numeric  (8-byte)  storage.  Define 

vectorized  storage 

storage  factor  = 

scalar  storage 

which  becomes 

2 ( 2m)  + 8 (md) 

storage  factor  = 

2(md)  + 8(md) 


.4 

= .8  + 

d 

Thus,  a vectorized  map  will  result  in  a 20%  storage  savings  as 


3.  Speed:  symbolic 

A speed  improvement  is  possible  in  both  the  symbolic  and 
numeric  processing  stages. 

Consider  again  in  Figure  2 the  symbolic  process  of  creating 
a vector  fill.  Since  each  vector  is  described  by  a pair  (beginning 
row  position,  ending  row  position) , the  length  of  the  data 
structure  does  not  depend  on  the  vector  lengths.  Similarly,  the 
operations  performed  on  this  structure  to  determine  fill  are  not 
changed  if,  for  example,  all  vector  lengths  are  doubled.  Obviously, 
this  is  not  true  for  the  scalar  approach,  where  each  row  position 
must  be  examined  separately  for  fill. 

In  Table  A3  of  the  appendix,  symbolic  solution  times  are 
given  for  both  scalar  and  vector  versions,  each  processing  a 
family  of  finite  element  matrices.  For  matrices  of  dimension 
9 and  49,  the  scalar  version  is  faster;  the  vector  version  is 
nearly  three  times  the  speed  of  the  scalar  for  a matrix  of 
dimension  961. 

4.  Speed:  numeric 

A more  important  comparison  of  the  scalar  and  vector  approaches 
involves  the  repeated  numerical  processing  stage.  Although 
computer  architecture  characteristics  - particularly  the  extent 
of  instruction  and  operand  pipelining  - are  quite  influential 
here,  nonetheless  it  is  possible  to  construct  a simplified  model 
of  a typical  vectorizeable  operation  and  discover  the  conditions 
in  which  the  scalar  approach  may  be  preferable. 
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th 
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Figure  3.  Subtraction  of  packed  (kth)  column  from  expanded  (rth) 
current  column 


The  most  time-consuming  part  of  the  factorization  algorithm 
is  the  multiplication  and  subtraction  step  of  (12).  The  numerical 
values  of  the  k1"^  column  are  assumed  packed  as  illustrated  in 
Figure  3.  Multiplication  by  u,  can  therefore  be  performed  in 

K , I 

a loop  or  vector  mode  without  reference  to  the  row  positions. 

However,  the  subtraction  of  (12)  involves  knowledge  of  the  row 

positions;  this  is  shown  in  Figure  3,  where  three  segments 
t h 

of  the  kL  column  are  depicted  being  subtracted  from  the  expanded 
t h 

r column.  This  in  turn  requires  that  the  row  position  must  be 
addressed  either  one  at  a time  - one  for  each  numerical  value  - 
or  in  a vector  sense  - the  beginning  and  ending  row  address  of 
a dense  column  segment.  A Fortran  implementation  of  this  sub- 
traction process  for  a column  in  the  numeric  phase  would  be  of  the 
form 


DO  2 J=N1 , N 2 
K= I L ( J) 

2 X(K) =X(K)  - Y ( J+  L) 


Scalar 


N3= I L (Nl) 
N4=IL(N1+1) 

DO  2 J=N3 ,N4 

2 X(J)=X(J)  - Y ( J+L) 

Vector 


(22) 


Here,  the  current  column  (X  array)  is  addressed  through  the  IL 
array  in  the  scalar  case;  in  the  vector  case,  IL  contains  the 
beginning  and  ending  row  numbers  of  a dense  segment  of  the 
column . 

Clearly,  the  vector  algorithm  requires  more  startup  time  - 
to  compute  N3  and  N4  - but  less  loop  execution  time  than  the 
scalar  case,  which  involves  indirect  addressing. 

It  happens  that  these  simplified  loops  are  not  representative 
of  either  the  factorization  or  the  substitution  steps.  A more 
extended  test  is  given  by  the  loop  timing  program  of  Table  3, 
where  vectorized  and  scalar  versions  of  both  inner  product  and 
broadcast  multiply- subtract  implementations  of  the  inner  loop  are 
timed.  The  inner  product  is  usually  associated  with  a row-ordered 
substitution  process  (see  Section  A)  whereas  the  broadcast  operation 
is  used  in  row-ordered  factorization,  column-ordered  factorization, 
and  column- ordered  substitution.  All  loop  operations  may  be 
characterized  by  a loop  startup  (Tg)  and  a loop  execution  (T  ) 
in  the  manner  of  Eq.  (5);  experimental  results  are  given  in  Table  4 
in  terms  of  these  parameters. 


Loop  description 

Startup 

(ysec) 

Execution 

(ysec) 

Vector 

inner  product 

2.6 

1.6 

Scalar 

inner  product 

1. 

1.7 

Vector 

broadcast -sub tract 

2.8 

2.24 

Scalar 

broadcast- subtract 

. 72 

2.42 

Table  4.  Loop  timing  results;  Amdahl  470V/6 , Fortran  H,  double  precision 
Comparing  the  two  broadcast  loops,  the  vectorized  version  has  a 
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C-—  TIMING  TEST  FOR  INNER  LOOPS  OF  ALTERNATE  FACTORIZATION 
C AND  SUBSTITUTION  ALGORITHMS 

IMPLICIT  REAL«8(A-M*0-Z) 

01  MENS  ION  X(2<000) • 1(26000)*  JC (50000) 

00  12  J»1*2«000 
X( J)-1*D-50 
12  T(J)-X(J) 

6 00  1 >1*6000 

1 JC(J)-J 
XA»  1 • 

REA0(6*22)LVECT 
NT  IMES-50000/LVECT 
MR  ITE(6*22)LV£CT  *NT IMES 
22  FORMAT (215) 

L>6 
NT*  1 

N2-LVECT 

C»*»*  TEST  FOR  BUSTAVSON  FACTORIZATION  (SCALAR) 

CALL  TIME(R) 

00  3 M«1 tNT I MES 
00  2 J-N1 • N2 
K-JC( J) 

2 X(K)-XOO-XA-T(J»L) 

3 CONTINUE 

CALL  T I ME  ( 1 • 1 ) 

C»  — TEST  FOR  BUSTAVSON  F.  ( B*  SUB*  (SCALAR) 

SUM»0*O0 
CALL  T I ME ( 0 > 

00  7 M-1 *NT IMES 
DO  B J-N1*N2 
K«JC(  J) 

9 S UM-S  UN- X ( J )*  V ( K ) 

7 CONTINUE 

CALL  1 INE( 1* 1) 

N1>0 

DO  11  J-1*50300>2 
JC( J)«URANO(  IN  IT )* 1000*1 
11  JC(J*1)»LVECT*JC(J)-1 

NS-100000 

C«—  TEST  FOR  VECTORIZED  FACT*  ANO  SUBSTITUTION 
CALL  TIME(0> 

00  5 N-1*NTIMES 
I VT  1-0 

• N1-N1-1 

N3- JC(N  1 ) 

N1-N1*1 

N4«JC(N1) 

JD  IFF* IVT I-N3-1 
00  4 J«N3*N* 

4 X( J)«X( J)*XA»Y( J»JO IFF) 

IVT I- JDIFF-N4 

IF(N1 *GT*NS)60  TO  0 

5 CONTINUE 

CALL  T I N£ ( 1 • 1 ) 

N1»0 

C— • TEST  FOR  VECTORIZED  INNER  PRODUCT 
SUM-0.00 
CALL  TINE(0) 

00  IS  N-1.NTIME5 
IVTI-0 

It  N1-N1-1 

M3-JC(N1) 

Nt-N1*1 
N4«JC(N  1) 

JOIFF-IVT l-N3*1 
00  14  J-N3.N4 

14  SUM-SUM- XA-VOUO IFF) 

IVT I- J0lfF*N4 

IF (N 1 *8T *NS )00  TO  » 

15  CONTINUE 

CALL  T I ME( 1*1) 

00  TO  < 

ENO 


Table  3.  Timing  program  for  inner  loops 
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larger  startup  but  a smaller  loop  execution  time.  The  total 
loop  timings  become  equal  when  ( 2.  is  the  loop  length) 

.72  + 2.422  = 2.8  + 2.242 

or  2 = 11.  Thus,  a scalar  processor  is  found  to  operate  pre- 
ferentially on  vectors;  as  1 + «>  the  loop  execution  time  of  the 
vectorized  version  is  2.24/2.42  = .92  the  time  of  the  scalar. 

These  timings  will  be  used  throughout  the  report  to  estimate 
the  computation  times  for  large  problems.  To  illustrate  this  use, 
consider  the  problem  of  estimating  the  relative  times  of  scalar 
and  vector  factorization  of  the  961  x 961  matrix  of  Table  A3  using 
broadcast  inner  loops.  The  average  vector  length  (L  e)  of  this 
loop  is  taken  from  Table  A2  as  6.85  elements.  Therefore, 

scalar  loop  time  .72  + 2.42(6.85) 

= = .95 

vector  loop  time  2.8  + 2.24(6.85) 

This  compares  with  a fraction  for  the  factorization  step  from  Table 

A3  of  1146/1309  = .88.  Both  ratios  decrease  for  the  smaller  matrices 
in  the  finite  element  family  studied,  as  L becomes  smaller  and  the 

a.  V c 

vector  loop  startup  becomes  more  significant. 

D . Symbolic  vs.  Numeric  Speed 

As  the  average  vector  length  increases,  the  symbolic  processing 
time  should  decrease  relative  to  the  numeric  time.  To  quantify  this 
concept,  consider  a recalled  column  of  L with  m vectors  of  average 
length  Lsub  being  multiplied  by  a scalar  and  subtracted  form  the 
current  column.  The  symbolic  processing  can  be  expected  to  be 
proportional  to  m and  the  numeric  proportional  to  mL  ^ Therefore, 

Nfnumeric  processing  time) 

= K Lsub  (19] 

S(symbolic  processing  time) 


(25) 


This  simplified  analysis  yields  surprisingly  consistent  experi- 


mental results.  The  Lsu^  is  the  average  vector  length  of  the  subtract 
operation  in  the  inner  loop,  a quantity  that  can  be  measured  experi- 
mentally and  calculated  precisely  for  the  dissected  finite  element 
grid.  In  Table  5,  the  K of  (19)  is  shown  for  all  the  sparse  matrices 
studied.  For  the  wide  range  of  matrix  size  and  structure, 

.24  < K < .31. 

Using  K = .25,  one  can  estimate  from  Table  A3  the  N/S  ratio  for  the 
general  finite  element  grid  as 

N ( . 1 3) 2n 

S n-2.96 

giving  for  a 126  x 126  grid  N/S  = 4.  Since  the  symbolic  phase 
involves  mainly  comparison  operations  with  ordered  pairs  of  numbers 
(to  establish  fill  regions)  it  can  not  be  expected  to  be  itself 
vectorizeable.  Thus  the  above  value  for  K would  be  much  smaller  if 
both  phases  were  run  on  a vector  machine.  The  symbolic  phase  - 
viewed  as  a vectorizat ion  step  - would  most  efficiently  be  executed 
on  a scalar  processor,  and  the  numeric  phase  on  a vector  processor. 

As  the  N/S  ratio  becomes  larger,  the  symbolic  step  becomes 
useful  as  a simulation  tool.  The  number,  length,  and  type  of 
vector  operations  can  be  determined  in  the  symbolic  and,  knowing 
the  functional  characteristics  of  a processor,  computation  times 
can  be  predicted.  This  is  especially  useful  for  estimating  solution 
times  of  vector  processors  that  are  not  conveniently  available. 

• Such  a simulation  program  has  veen  devised  and  has  been  useful  in 

obtaining  precise  vector- related  structural  information  such  as 
displayed  in  Tables  A2  and  A5. 


Dissected  Rect.  Grid 
Matrix  (grid)  size 

N/S 

^sub 

K = (N/S) /L 

9 x 9 (2  x 2) 

. 53 

1.97 

. 27 

49  x 49  (6  x 6) 

.69 

2.92 

. 24 

225  x 225  (14  x 14) 

1.12 

4.42 

.25 

961  x 961  (30  x 30) 

1.  7 

6.  85 

. 25 

Electric  Power  Problem 

.676 

2.6 

. 26 

Aircraft  Landing  System 

. 500 

1.6 

. 31 

Electronic  Device  Model 

4.  76 

16.  3 

. 29 

Table  5.  Experimental  determination  of  K in  Equation  19,  for  an 
Amdahl  470  V/6,  double  precision,  Fortran  H. 


E . Inner  Loop  Considerations 
1.  Introduction 

For  large  matrices,  the  multiply- subtract  inner  loop  becomes 
the  dominant  computation.  To  illustrate,  inner  loop  operations 
for  the  factorization  of  the  largest  finite  element  matrix  of 
Appendix  A involves  59026  inner  loop  startups  and  404587  loop 
executions.  From  the  benchmark  of  Table  4,  the  inner  loop  time 
can  be  estimated  as 

T inner  = (59026) (2. 8)  + (404587) ( 2 . 24) 

= 1.07  x 10^  ysec. 

This  is  81  percent  of  the  measured  factorization  time  of  1.309 
sec.  of  Table  A3.  Since  this  estimate  accounts  for  the  time  devoted 
to  the  multiplication  of  a scalar  from  U by  a column  from  L,  the 
percentage  of  time  devoted  to  the  inner  loop  computation  will 
increase  with  the  column  density  of  L;  from  Table  A1 , this  is  a 
logarithmically- increasing  function  of  grid  size. 
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In  the  following  sections,  issues  related  to  the  implementation 
of  this  loop  are  discussed;  the  scalar-vector  comparisons  have 
been  made  in  section  C4. 

2.  Expanded  vs.  packed  list  structures 

The  inner  loop  of  the  factorization  involves  a multiply  of 
a packed  column  of  L by  a scalar  and  a subtraction  of  the  result 
from  the  current  column  being  factored.  The  multiply  can  be  carried 
out  in  one  vector  multiply;  the  subtraction  must  be  performed  with 
attention  to  row  numbers  of  the  minuend  and  subtrahend.  This 
requirement  suggests  at  least  two  procedures  based  on  the  data 
structure  of  the  current  column. 

1)  If  the  current  column  is  fully  expanded,  subtraction  can  take 
place  using  the  address  of  the  initial  row  position  of  the  current 
column  as  a reference;  as  illustrated  in  Figure  4a,  a vector  in  the 
packed  k column  can  be  subtracted  from  the  r column  with  only 

a single  address  offset  calculation.  The  current  column  must  be 
expanded  from  the  initially  packed  matrix  and  must  be  compacted 
after  the  last  subtraction,  usually  a trivial  0(n)  process.  Also, 
the  storage  demanded  by  this  expansion  process  is  usually  not  a 
problem  even  with  widely- separated  elements  or  blocks  of  elements. 

2)  If  the  current  column  is  packed  with  an  accompanying  vectorized 
list  description  of  the  row  numbers  of  L and  U,  then  subtraction 
involves  searching  the  list  until  the  row  number(s)  of  the  minuend 

are  in  the  range  of  those  of  the  subtrahend;  in  Figure  4b,  for  example, 
the  current  list  would  be  scanned  for  subtraction  of  the  scalar  in 
row  7 and  the  vector  in  rows  16-18;  the  locations  of  the  numerical 
values  in  the  packed  current  vector  would  then  be  calculated  for 
the  vector  in  rows  16-18  as  (for  example) 


packed 

kth 

column 


expanded 
r th 

column 


zero-varied  positions  of  LU 


QD  } 


all  offsets  calculated 
from  first  column  position 


(a)  Expanded  current  column  of  LU 


packed 
k th 

column 


packed 

rth 

strip 


9 th  position 


20  |42 

l9|52[^r 

r|i8  5lUo| 

4ll7;5°i39l 

[iieTiadoi 
TF[l7j35 
13  I6|34 
j!2  15 133 
l I 14132 
10  13131 

'{  7 l2|i0 
6 lOl  6 


-7  ,16,18 


column  breck 

i n r 

5,7,10,13,15,20,9,10,12,18,  etc 
skipped 


offset  offset 

calcul  :ted  calculated 
from  5 from  15 


(b)  Packed  current  column  of  LU 


Figure  4.  Illustration  of  subtraction  operations  on 
expanded,  packed  list  structures. 
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[ ( 7 - 5)  + 1 ] + [(13-10) + 1]  + [ ( 1 6 - 1 5 ) + 1 1 = 9 
denoted  on  Figure  4b.  Thus,  a running  total  of  the  offset  of 
each  vector  of  the  current  column  must  be  maintained  as  the  list 
is  scanned. 

The  scanning  and  address  calculation  is  clearly  overhead  which 
must  be  small  relative  to  the  subtraction  time  itself.  Thus, 
the  number  of  vectors  represented  by  the  list  length  - must  be 
small  and  their  size  must  be  large.  This  is  in  fact  a characteristic 
of  matrices  arising  from  large  FD  and  FE  problems,  where  each 
column  will  have  on  the  average  4-6  vectors  with  an  average  length 
of  30  or  more  numerical  values  . Smaller  FD  or  FE  problems  or 
matrices  not  having  these  characteristics  are  more  efficiently 
solved  with  the  expanded  storage  of  (1).  For  these  reasons,  the 
programs  to  be  later  described  were  written  using  both  procedures: 
the  program  using  only  local  store  for  small  matrices  will  have 
expanded  column  storage;  the  program  allowing  a backing  store  will 
use  a packed  current  column  throughout.  The  third  data  column  of 
Table  A3  and  the  seventh  column  of  Table  A4  yield  a comparison  of 
timings  resulting  from  expanded  and  packed  current  column  storage; 
e.g.,  for  a 961  x 961  matrix,  factorization  requ  res  1309  msec  and 
1667  msec,  respectively.  The  relative  speeds  as  a function  of  n 
show  a fractional  difference  decreasing  toward  1 between  the  two 
timings,  the  above  ratio  being  only  1.22.  Undoubtedly,  the  increasing 
Lave  with  problem  size  is  primarily  responsible  for  decreasing 
overhead  in  traversing  the  packed  column  lists. 

3.  Sequencing  of  the  multiply- subtract  operations 

Another  inner  loop  issue  introduced  by  vectorized  solution  is 
whether  the  multiplication  of  a column  of  L will  be  performed  in  a 
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single  vector  operation  or  whether  the  multiplication  and  subtraction 
will  be  performed  within  the  same  loop.  The  latter  has  the  advantage 
in  a scalar  processor  that  fewer  loads  and  stores  are  necessitated, 
since  the  result  of  the  multiplication  is  available  either  in  a 
register  or  in  cache  memory  for  the  following  subtraction.  However, 
fewer  vector  multiplies  (i.e.,  startups)  are  required  for  the  former 
process.  Thus  both  options  should  be  available  in  a general  program. 
Note  that  this  consideration  appears  in  the  forward  and  back  sub- 
stitution as  well.  The  third  data  column  of  Table  A3  shows  the 
significant  impact  in  a cache  machine  of  combined  multiply-subtract 
operations  (the  number  in  parenthesis  is  the  timing  with  separate 
operations).  For  a 961  x 961  matrix,  the  ratio  is  1773/1309  = 1.35 
for  the  factorization  and  156/113  = 1.38  for  the  forward  and  back 
substitution. 

4.  Assembly- language  programming 

The  speed  of  an  algorithm  is  related  to  the  language  in  which 
it  is  programmed.  Thus,  it  is  important  to  document  the  extent 
to  which  an  algorithm  is  influenced  by  programming  at  least  the 
inner  loop(s)  in  a high  level  language  such  as  Fortran. 

An  assembly  language  version  of  the  expanded  column  factorization 
algorithm  of  Figure  4a  was  written,  with  the  inner  two  loops 
(k=l , 2,  . . . r-1)  of  Eq.  (12)  coded  in  assembly  language.  Table  A3 
gives  timing  comparisons  vis-a-vis  a Fortran  H implementation. 

For  a 961  x 961  finite  element  matrix,  the  ratio 

Fortran  H timing  1309 

= = i.4i 

Assembly  timing  922 

shows  a significant  but  not  atypical  advantage  in  machine-dependent 
coding.  This  depencence  on  the  quality  of  Fortran  code  must  be 
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minimum  will 


in  general  be  Quite  hrn.-irl 
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borne  in  mind  when  other  timing  comparisons  are  made  in  this  report. 

F • Port  it ioning 

1 . Introduction 

Partitioning  the  solution  of  simultaneous  linear  equations 
refers  to  the  division  by  the  user  of  matrix  and/or  LU  storage 
between  a local  store  (either  real  or  virtual)  and  a slower  backing 
store  accessible  to  the  ALU  only  through  the  local  store.  The 
local  store  contains  that  part  of  the  matrix  on  which  computation 
is  being  performed,  so  that  read/write  (I/O)  operations  are 
necessary  between  local  and  backing  store  to  carry  out  the  complete 
matrix  factorization.  The  purpose  of  this  partitioning  is  either 
(1)  to  meet  the  storage  limitations  imposed  by  a real  memory  too 
small  to  contain  the  complete  factored  matrix,  or  (2)  to  minimize 
the  cost  of  computing  in  a real  or  virtual  system,  where  costs 
depend  on  both  storage  used  and  user-prescribed  I/O  operations.  The 
sparse  matrix  software  to  be  described  in  the  next  chapter  has 
considerable  flexibility  in  performing  this  partitioning  even 
before  the  matrix  is  numerically  solved.  Wise  use  of  this  flexibility 
requires  modeling  both  the  computing  system  and  the  matrix  structure. 
The  purpose  of  the  following  discussion  is  to  provide  insight  into 
the  issues  involved  and,  by  example,  specific  guidelines  for  the 
user  of  the  software. 

Before  proceeding,  it  is  worth  noting  that  many  of  the  concepts 
and  even  some  of  the  detailed  analysis  to  be  presented  applies  to 
other  memory  heirarchies  such  as  register-cache,  register-main,  and 
cache-main,  where  a small  fast  memory  communicates  with  a large  slow 


memorv. 


2 . Fixed  local  store 
a.  Introduction 

The  simplest  model  for  analysis  is  that  of  a processor  with 
fixed  local  store.  In  this  environment,  common  to  large  scientific 
processors,  the  I/O  operations  are  managed  by  either  (a)  the 
central  processor,  or  (b)  a peripheral  processor  sharing  the  local 
store  with  the  central  processor.  In  the  first  case,  the  total  CPU 
time  would  be  the  dominant  issue;  in  the  latter,  the  ratio 
6 = (I/O  time)/(CPU  time)  would  yield  the  fraction  of  time  that 
the  I/O  processor  is  busy,  so  that  when  6 > 1 the  CPU  must  wait 
on  the  I/O  processor  for  its  operand  supply.  This  section  will 
concentrate  on  analysis  of  the  6 ratio,  since  the  former  will  be 
a special  case  of  variable  storage  time  minimization  (b=0)  considered 
in  the  next  section. 

Clearly,  both  the  CPU  time  and  storage  depend  on  matrix  structure 
so  that  an  analysis  must  be  aimed  either  at  obtaining  a precise 
evaluation  of  a given  matrix  structure  or  at  establishing  general 
relationships  for  a class  of  matrix  structures.  In  preparation 
for  the  latter,  the  specific  case  of  a full  matrix  will  be  studied 
to  gain  insight  into  the  analysis  procedure.  The  reader  may  refer 
to  [17]  for  alternate  partitioning  schemes  that  may  result  in 
less  I/O  traffic  for  full  matrices  than  the  one  to  be  studied 
here. 


b.  Full  matrices 


In  Figure  5,  a full  matrix  of  size  n (=6)  is  divided  into  k 

partitions  of  size  S^,  each  having  p = n/k  columns.  The  number 

2 

of  writes  to  backing  store  is  n , assuming  that  the  entire  factored 
matrix  must  reside  on  backing  store.  The  number  of  reads  is 


1st  2nd  kth 
strip  strip  strip 

X X X X X X 
Xs  NX  X X X X 
X Xs  Lx  XXX 


X X X > X X X 


X X I X X Xx  X 
X X I X X,  I x''  ,x 

9x2  + 5x1  = 23  reads 


Figure  5.  Counting  reads  in  a factorization 


N = Z P^P'1^  r + p2 (k- r) 2 


. p (p- 1) (k- 1) (k)  + p^(k-l) (k) (k-%) 


With  total  storage  S and  strip  size  , then  k = S/Sj,  and  p = S^A/5- 
The  second  term  of  (20)  easily  dominates  the  expression,  and  the 


dominant  term  can  be  written 


pV 


If  the  I/O  transfer  time  is  c seconds/byte  and  the  floating 
lint  operation  time  is  d seconds/ (multiply-subtract) , the  ratio 


I 


<S  is 


6 = cNr/dNop  . (22) 

With  typical  values*  c = .23x10  ^ and  d = 2.24x10  ^ and  assuming 
64-bit  words,  for  large  n 

6 = [(8n2)2(.23xl0~6)/ (24  SR) ] / [ 2 . 24x10' 6n3/3] 

= .824n/Sk 

Thus,  when  n/Sk  = 1 (only  one  column  is  in  local  store),  a nearly 

equal  time  is  devoted  to  arithmetic  computation  and  to  I/O.  For 

larger  than  this  absurdly  small  value,  arithmetic  time  predominates. 

- 9 

For  a vector  processor  such  as  the  Cray  1,  where  d = 12.5x10 
- 8 

and  c = 1.56x10  , the  ratio  becomes 

<5  = 10.n/Sk 

and  a local  store  of  10  columns  would  be  adequate  to  keep  the 
processor  supplied  with  operands, 
c . Sparse  matrices 

For  families  of  sparse  matrices  with  a relatively  constant 

column  density,  the  storage  of  L and  U is  typically  S = 0(nlog  n) 

or  even  S = 0(n)  and  the  computation  time  is  typically  0(n*'3),  in 
2 3 

contrast  to  0(n  ) and  0(n  ) respectively  for  a full  matrix.  Thus, 
the  asymptotic  ratio  of  computation  to  storage  is  lower  for  a sparse 
matrix,  hinting  that  the  1/0  transfer  may  become  a more  significant 
cost  factor.  It  will  be  shown  that,  asymptotically  in  n,  1/0  can 
be  a less  significant  factor  for  sparse  matrices;  further,  some 
guidelines  will  be  established  for  the  estimating  the  growth  with  n. 

The  full  matrix  model  of  Figure  5 may  be  generalized  by 
considering  a matrix  of  k partitions,  each  with  Sk  = S/k  storage. 

This  does  not  require  that  each  partition  have  the  same  number  of 


*For  the  Amdahl  470V/6  system  at  the  University  of  Michigan. 
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columns  as  above,  allowing  more  columns  in  the  usually  sparse 
initial  column  strips.  As  shown  in  Figure  b the  L storage  is 
assumed  distributed  throughout  each  strip  so  that 

Sk,L  SL  + r)  sl  i Sk,U  SU  + T = 1’2,,,,k 

o l o 1 

b)  the  non-zero  structure  of  each  column  strip  is  distributed 

so  that  each  strip  must  be  recalled*  for  the  factorization 

of  all  succeeding  strips. 


Figure  6.  Sparse  matrix  partition 

The  total  storage  S is  then  given  by 
k 

S = E (S  + S„  ) + (k-r)S  + (r-k)S„ 
r=l  Lo  U0  1 1 

= k2Sx  + 0 (k)  (23a) 

where  S.  = S.  = S..  . The  recalled  storage  is  then 
1 L1  u1 


*It  will  be  assumed  throughout  this  report  that  only  complete  L strips 
can  be  recalled. 
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the  same  as  (21)  for  a full  matrix. 


To  experimentally  corraborate  this  simplified  expression, 

Table  6 shows  the  total  number  of  reads  of  symbolic  and  numeric 
information  during  the  factorization  of  the  finite  element  matrices 
of  appendix  Table  A4 , together  with  the  value  calculated  from  (24). 
The  error  is  within  10  percent  for  large  matrices,  the  discrepency 
probably  due  to  assumption  that  all  strips  must  be  recalled  in  the 
factorization  of  the  current  strip. 


sk 

S2/3Sk 

Experimental 

.232 

. 13 

.15 

.101 

.31 

.22 

.0355 

. 88 

.83 

.006500  4.8  4.25 

Table  6.  Comparison  of  calculated  and  experimental  recall  reads 
(in  megabytes).  S = 306,000  bytes. 

As  a result  of  (24),  for  a constant  for  S = 0(n  log  n) , 0(n) 

Nr/Nop  = 0(n*5(log  n)2),  0(n'S)  (25) 

indicating  that  relatively  fewer  reads  are  required  for  a sparse 
matrix  than  for  a full  matrix,  for  sufficiently  large  k and  n.  A 
more  qualitative  analysis  will  shortly  show  an  excessive  number  of 
reads  can  occur  for  typical  values  of  n. 


Alternatively,  Equation  (25)  can  be  used  to  estimate  the  growth 
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Figure  7.  Banded  solution 

In  general,  sparse  matrices  have  a structure  that  varies  between 
local  and  complete  strip  connectivity.  Thus,  with  a fixed  local 


One  caveat  must  be  raised.  Sparse  matrices  are  inevitably 
ordered  so  as  to  reduce  fills  occuring  during  solution.  Ordering 
algorithms  can  not  be  expected  to  detect  strip  connectivity,  and 
it  is  not  uncommon  for  strips  to  have  widely-distributed  coupling 
tttjo.ther  strips.  Thus,  a strip  storage  equalization  as  illustrated 
in  Figure  8a  may  not  yield  a minimum  number  of  I/O  operations,  due 
to  scattered  non- zero  positions.  A slight  adjustment  yields  the 
improved  situation  of  Figure  8b. 
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Figure  8.  Reduction  in  reads  by  unequal  strip  storage. 

d.  The  I/O  problem  for  finite  element  solutions 
Although  asymptotically  the  I/O  problem  may  be  less  serious  for 
a sparse  matrix,  the  more  important  issue  is  whether  I/O  will  dom- 
inate for  a particular  sparse  matrix.  An  illustrative  analysis  for 
the  finite  element  family  of  Appendix  A will  show  that  a serious 
problem  can  indeed  exist  even  for  the  computational ly- intensive 
factorization  process. 
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from  the  asymptotic  formula  for  storage  count  in  Table  A1 , 

the  read  time  can  be  counted  as 

1=1.  (S2/3S.) 
r trl  k' 

= T (124n-344)224n/24Sk  (27) 

where  T is  the  transfer  time  of  a 64-bit  word.  The  arithmetic 

operation  time  is 

T = T ( 39 . 5 ( 2-511) ) 
op  arl  v 1 ' 

where  T is  the  time  of  an  arithmatic  operation.  Now  define  the 
ar 

critical  dimension  as  that  value  of  n(  = n ) for  which  T = T ; 

v c’  rw  op’ 

for  n > nc,  Tr  will  exceed  T . 


I/O  Transfer  Time 
(sec . /word) 

Local  Store 
(megawords) 

MFLOPS 

Gr  id 

size 

Storage  (S) 
(megawords) 

Time 

(sec.) 

ASC 

(1 -pipe) 

2xl0'7 

- IQ'6 

1 - 8 

25. 

132  - 

1606 

1.1  - 315. 

7.3  - 

13100 

ASC 

(4-pipe) 

2xl0'7 

- lO’6 

1 - 8 

100. 

60  - 

588 

.172*-  34. 

.16  - 

160. 

Cray 

1 

10 

-7 

1 - 4 

160. 

17S  - 

430 

2.2  - 17.1 

2.6  - 

39.4 

‘Backing  store  not  required 

Table  7.  Grid  size  for  equal  I/O  and  operation  times 
on  current  vector  processors 

Table  7 shows  that  the  critical  dimension  can  be  surprisingly  small 

for  current  vector  processors.  Ranges  for  the  critical  grid  size 
nc 

(^2  ) are  given  for  normal  ranges  of  T - corresponding  to 

commonly  used  disc  capacity  - and  of  S,  . An  interesting  case  is 
the  Cray-1,  where  because  of  extraordinary  processor  speed,  the 
I/O  time  will  exceed  the  arithmetic  operation  time  after  39.4 
seconds,  even  using  full  I/O  channel  capacity  and  addressable 
local  store. 


(40) 


c.  The  I/O  problem  in  the  substitution  process 
The  I/O  problem  is  proportionally  more  severe  for  the  forward 
and  hack  substitution  steps,  where  only  a few  numeric  computations 
are  performed  with  each  element  of  the  recalled  factored  matrix. 

Consider  an  arithmetic  computation  sequence  where  K arithmetic 
computations  are  performed  on  the  average  on  every  L words  in  mem- 
ory, by  a processor  with  an  operation  rate  of  M floating  point 
operat ions/second  (PLOPS) . Then  the  local  store  must  be  supplied 
from  the  backing  store  at  the  rate  of 

ML/K  words/sec.  (28) 

In  the  forward  and  back  substitution  stages,  the  inner  loop 
instruction  will  be  of  the  general  form 

X(I)  = X(I)  - LU(J)*YD  (29) 

where  LU(J)  contains  the  elemtns  of  L and  U.  Each  such  element  is 
used  a single  time  in  the  two  substitutions,  so  that  (ignoring 
array  X and  scalar  YD)  L = 1 and  K = 2 in  (28).  If  the  LU  array 
is  on  a backing  store,  then  this  store  is  required  to  supply 
operands  for  (29)  and  M/2  words/sec. 

This  is  a prohibitive  rate,  as  evidenced  in  Table  8.  Here, 

the  processor  utilization  ratio  given  by 

PUR  = ar:'-t^met^c  operation  time 

I/O  + arithmetic  operation  times 

are  given  for  several  of  the  current  vector  processors  executing 
the  substitution  formula  of  (29).  The  PUR  is  susally  below  .15 
and  becomes  as  low  as  .02. 

It  is  important  to  note  that  this  poor  processor  utilization 
is  dependent  only  on  the  machine  system  parameters , and  not  on 


1 he  ma t r ix  size,  density,  or  equat i on  ordering . Thus,  the  concern 


of  Knight  et  al  [19]  for  banded  matrices  is  shown  to  he  generalized 
to  the  substitution  process  itself. 


Processor 

I/O  Transfer  Time 
( sec . /word) 

Mult . - Sub . Time 
(sec . /op . ) 

PUR 

ASC  ( 1 -pipe) 

2x10' 7 - 10'6 

80  x 10' 9 

.28  - .074 

ASC  (4 -pipe) 

2x  1 0 " 7 - 10'6 

20  x 10'9 

.091  - .019 

Cray  1 

10'7 

12.5  x 10'9 

.11 

Table  8.  Projected  processor  utilization  for 
current  vector  processors. 

To  support  this  claim  of  independence  of  matrix  properties, 

the  experiments  of  Table  3 for  the  Amdahl  470V/6  scalar  processor 

can  be  used.  The  operation  time  of  the  inner  loop  has  been 

determined  in  Table  4 (=2.24xl0'6  sec/operation),  and  the  I/O 

transfer  time  given  in  Table  A4  as  1.83  sec. /word.  The  PUR  is 

then  calculated  as  for  all  matrices  as 

2.24 

PUR  = = .55 

2.24  + 1.83 

Experimentally,  the  same  ratio  can  be  determined  from  the  components 
of  Tj of  Table  A4  for  the  finite  element  family  of  matrices. 


Matrix  PUR 

size 


49 

. 61 

225 

. 58 

961 

. 56 

lable  9.  Illustration  of  constant  processor 
utilization  ratio. 
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The  results  are  given  in  Table  9;  these  show  a PUR  close  to  .55 
for  all  matrices,  and  an  asymptotic  approach  to  this  value  for  the 
larger  matrices.  The  small  descrepency  is  due  to  the  loop  startup 
and  pivoting  times  not  included  in  the  above  model. 

At  present,  there  is  no  known  solution  to  this  problem  in  the 
substitution  process.  Either  one  must  hope  that  the  processor  can 
be  utilized  for  other  tasks  while  the  I/O  is  busy,  or,  in  some 
(few)  applications,  multiple  substitutions  may  be  carried  on 
simultaneously  with  the  same  factorized  matrix. 


Variable  local  store 


3 . 


Local  store  is  commonly  variable  as  a result  of  a multi- 
user environment,  i.e.,  a large  store  is  divided  between  two 
or  more  users.  If  this  large  local  store  is  virtual,  then 
it  will  be  assumed  that  the  user  is  not  assessed  the  costs 
of  paging  or  swapping  from  the  system's  backing  store. 

When  I/O  is  handled  by  the  central  processor,  the  cost 
of  variable  local  store  is  commonly  charged  according  to  the 
length  of  time  and  the  amount  of  store  used.  Specifically, 
the  cost  of  a program  execution  can  be  written 

cost  = a (CPU  time)  + a b (local  storage) (CPU  time) 

= a (1  + b (local  storage) ) (CPU  time)  (30) 

where  a is  a charging  coefficient  (dollars/CPU-second) 

b is  a coefficient  (pages  ^)  that  converts  the  storage 
costs  to  unit  CPU  costs. 

The  total  CPU  time  is  clearly  c N + d N . The  local 
storage  must  have  a value  2 S^,  to  accomodate  both  the  current 
strip  and  the  recalled  strip.  The  cost  can  then  be  written 
as  a function  of  S^  for  a full  matrix  solution  with  large  n as 


cost  . 0(1  * b(Sp.2Sk))(|f-  * dNop  * Tov) 

* SP  ( 3T3n  X + SP 

k v op  ovJ 


- S,  fal  + Sk) ^a2  + Sk^ 


(31) 


I 

4 

* 


where  S is  the  fixed  program  storage  and  Tqv  is  overhead 
time  not  attributable  to  either  numeric  or  I/O  computation. 
Although  this  cost  function  has  a minimum  at  = /a the 

(44) 


minimum  will  in  general  be  quite  broad.  The  sharpest  minimum 
occurs  when  = a2  or  n = 41000  and  = 50  pages,  for  b = .01. 
Even  this  minimum  is  broad;  the  cost  does  not  exceed  double 
the  minimum  value  for  cij/7  < < 60^.  For  more  practical 

values  of  n,  the  local  storage  costs  considerably  exceed  the 
1/0  costs,  and  the  former  can  easily  be  evaluated  from  the 
first  factor  in  (31). 

To  apply  this  analysis  to  a sparse  equation  system  for 
which  (31)  applies,  consider  again  the  finite  element  family 
of  Appendix  A,  solved  on  the  University's  system.  A summary 
of  parameters  and  formulae  pertinent  to  this  calculation  are 
shown  in  Table  10. 


Program 


storage 


(pages) 

Driver  with  arrays 

7.7 

Solver  with  1/0 

7.0 

System  routines 

7.8 

1/0  buffers 

45.0 

Total 

65.  5 

LU  storage  S (Table  Al) 

Tov 

Arithmetic  operations  (N  ) 

1/0  transfer  rate  (1/c) 

Arithmetic  operation  time  in 

inner  loop  (d) 


(124 n -344)2 

0.0  sec. 

39. 5(23n) 

1000  pages/sec. 

1 . 1 x 1 0 ‘ 6 sec/operation 


Table  10.  Parameter  summary 


For  n=5,  a 961  equation  system,  it  may  be  determined  that 
= 82.5,  «2  = 1.09,  yielding  a cost  minimum  at  = 9.5 
pages.  The  estimated  cost  is  plotted  versus  in  Figure  9; 
the  minimum  is  shown  to  be  quite  broad  due  to  the  relatively 
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Cost  per  numeric  solution 
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small  cost  of  local  store  on  this  virtual  memory  system 
(e.g.,  b= . 01  implies  that  when  Sp  + 2Sk  = 100  pages,  storage 
costs  become  equal  to  CPU  numeric  computation  costs).  Several 
correlations  of  estimated  and  actual  solution  costs  has  been 
made.  Figure  9 shows  a rather  large  discrepancy  due  princi- 
pally to  the  use  of  inner  loop  timing  estimates  only,  the 
failure  to  include  substitution  steps  in  the  estimate,  and  the 
assumption  that  T =0.  Note  that  Table  A4  shows  that, 
for  ZS^  = 13k  bytes,  approximately  .57/3.7  = .15  of  the 
factorization  time  is  unaccounted  for.  It  may  be  that  this 
overhead  is  compensated  by  the  use  of  asymptotic  values  for 
S in  Table  10. 


Strip  storage  (S^)  - pages 


Figure  9.  Comparison  of  measured,  calculated  cost  of  virtual 
memory  numerical  solution. 
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0 . Lvaluat ion 

This  chapter  has  been  primarily  concerned  with  modeling 
and  analysis  of  components  of  the  equation  solver.  The  de- 
tailed timing  analysis  performed  on  critical  parts  such  as 
arithmetic  kernels  and  1/0  routines  cannot  reasonably  be 
performed  on  every  code  sequence.  Therefore,  although 
the  asymptotic  performance  for  large  problems  may  be  reliably 
predicted  from  these  key  components,  the  small  problem 
performance  cannot;  indeed,  one  does  not  know  for  certain 
what  qualifies  as  a small  problem  without  analysis  of  all  the 
code  segments.  Fortunately,  since  the  complete  equation 
solvers  have  been  implemented  (Chapter  3),  it  is  possible 
to  make  evaluations  by  comparing  run  time  performance. 

One  high-level  performance  measure  frequently  used  for 
evaluation  of  processors  with  complicated  parallel/pipeline 
architectures  executing  simple  software  kernels  is  the  millions 
of  floating  point  operations  per  second  (MFLOPS) . Plotted 
versus  problem  size,  the  rate  at  which  the  MFLOPS  approach 
asymptotic  values  imposed  by  the  speed  of  processor  arithmetic 
units  gives  a succinct  display  of  overall  small  problem 
performance  meaningful  to  user  and  algorithm  developer 
alike.  We  propose  to  use  the  same  MFLOPS  performance  as  a 
comparative  measure  of  complicated  scientific  packages 
executing  on  a relatively  simple  (scalar)  architecture. 

Figure  10  shows  the  MFLOPS  dependence  of  four  sparse 
equation  solvers  applied  to  the  family  of  finite  element 
problems  of  Appendix  A.  The  asymptotic  values  shown  are 
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obtained  from  the  execution  mode  timings  of  the  innerloop 
obtained  in  Table  4.  The  MFLOPS  are  obtained  from  the 
timings  of  Table  A3  and  a knowledge  of  computational  com- 
plexity. Several  comparisons  of  the  plots  seem  meaningful. 

1.  Over  the  range  of  n shown,  the  scalar  version  shows 
a relatively  superior  performance  over  the  vector  version  for 
intermediate  values  (n=3,4)  where  the  overhead  of  the  inner 
loop  is  significant.  For  n=2,  other  program  overhead  tends  to 
produce  equally  poor  performance  for  both  versions;  for  n=5, 

the  larger  inner  loop  startup  time  of  the  vector  version  becomes 
less  significant  and  the  gap  between  the  two  decreases. 

2.  The  large  program  overhead  of  the  partitioned  version 
makes  this  very  unattractive  for  n=2,3,  but  when  n=5  the  rate 
of  increase  in  MFLOPS  is  large  and  suggests  that  this  version 
should  be  operating  at  501  or  more  of  the  asymptotic  MFLOPS 
value  for  problems  beyond  the  range  of  a 1-2  megabyte  local 
store . 

3.  The  code  generation  version  shows  a remarkably  fast 
climb  to  the  asymptotic  value  (the  vector  inner  product 
execution  timing  of  Table  4 is  used).  Unfortunately,  the 

code  length  did  not  permit  larger  values  of  n to  be  investigated. 


It  is  planned  to  extend  these  revealing  characteristics 


to  larger  values  of  n for  the  partitioned  version,  and  to 


test  this  version  on  commercial  vector  processors.  Perhaps 
more  important,  this  appears  to  be  a useful  measure  of  the 
efficiency  of  general  sparsity  algorithms  versus  more  special- 
ized full-,  band-,  and  block- solvers  operating  from  a backing 
store.  If  the  general  algorithm  can  be  shown  to  be  as  easy 
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to  use  (see  formulat ional  aids,  Chapter  3)  and  near  the 
MFLOPS  rate  of  special  algorithms,  the  usefulness  of  the 
partitioned  general  solver  will  be  well  established.  (48) 


Chanter  3.  DESCRIPTION  OF  A DUAL  VECTORIZED  GENERAL  SPARSE  EOUATION 
SOLVING  PACKAGE 

A . Int  roduct ion 

The  results  of  the  appendix  referenced  throughout  Chapter  2 
were  produced  by  two  general  sparse  equation  solvers.  Together 
with  the  (processor-dependent)  code  generation  program  of  [6]  [8], 

these  Fortran  language  solvers  allow  the  user  to  solve  sparse  sytems 
of  equations  ranging  from  tens  to  tens  of  thousands  of  eqations. 

For  small  systems,  Table  A3  shows  that  code  generation  is  several 
times  faster  than  other  algorithms;  when  memory  size  to  contain 
the  code  becomes  a limitation  - perhaps  for  several  hundred 
equations  - the  matrix  often  has  sufficient  local  density  to 
warrant  a vectorized  solution.  As  the  system  size  increases  to 
several  thousand  equations  the  LU  storage  becomes  excessive  and 
a program  with  backing  store  becomes  necessary.  Since  the 
majority  of  sparse  matrices  solved  on  current  fast  scalar  and 
vector  processors  are  beyond  the  size  manageable  with  code,  the 
latter  two  approaches  appear  the  most  useful. 

The  dual  software  package  described  in  this  chapter  allows 
the  user  to  begin  with  an  equation  solver  contained  in  local 
store  and  then,  as  problem  size  grows,  to  include  a backing  store 
with  little  change  to  the  application  program.  There  is,  of 
course,  some  reprogramming  of  the  I/O  section  of  the  second  solver 
necessary  for  different  file  systems. 

In  developing  the  dual  sparse  solver  package,  several  general 
guidelines  were  used. 

1)  The  package  should  be  able  to  solve  efficiently  sparse 
equations  ranging  in  density  from  several  elements  per  column 


(electrical  power  systems)  to  several  hundred  elements  per  column 
(finite  element  problems).  Since  the  structures  of  such  matrices 
are  described  by  quite  different  data  structures,  this  assumption 
resulted  in  a distinction  being  made  between  the  user-supplied 
matrix  description,  and  the  data  structure  utilized  by  the 
symbolic  and  numeric  solvers.  Thus,  any  of  a variety  of  structural 
descriptions  can  be  entered  by  the  user  (see  flow  chart  of  Figure  15) 
these  are  converted  to  a common  vectorized  data  structure  prior  to 
symbolic  and  numeric  processing.  It  is  recognized  that  an  additional 
storage  will  be  required  to  describe  the  structure  of  block  matrices, 
vis-a-vis  a "hypermatrix"  [14] [15]  [16]  representation.  However, 
as  shown  in  Table  A1 , even  for  relatively  small  sparse  matrices, 
the  symbolic  storage  is  far  less  than  the  numeric  storage  - 
especially  for  machines  with  16-bit  integer  format  - so  that  the 
additional  I/O  incurred  in  recalling  previous  column  strips  con- 
taining both  numeric  and  symbolic  imformation  is  not  significant. 

2)  The  processor  should  have  at  least  1-2  megabytes  of 
local  (real  or  virtual)  store.  Thus,  it  is  assumed  that  the 
symbolic  matrix  structure  fits  in  local  store  and  that  a full 
column  (row)  of  the  numeric  storage  occupies  a smal1  fraction  of 
local  store.  This  assumption  rules  out  a typical  minicomputer 
system. 

3)  The  package  should  run  efficiently  on  both  scalar  and 
current  vector  processors,  although  it  is  recognized  that  per- 
formance could  be  improved  in  some  combinations  of  matrix  structures 
and  vector  architectures  by  major  revision  of  the  symbolic  and 
numeric  algorithms  (e.g.,  blocked  matrices  being  solved  on  a pro- 
cessor with  a high-level  vector  capability,  or  very  sparse  matrices 
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being  solved  by  multi-row  factorization  [131).  Certain  common 
vector  instructions  are  accomodated  (e.g.,  the  inner  product 
instruction)  by  supplying  alternative  forward  and  hack  substitution 
algorithms. 

Tn  this  chapter,  issues  related  to  the  use  of  the  package 
are  examined,  including 

(1)  a simple  example  of  its  use,  without  backing  store  and 
with  user  vectorization  of  the  structural  input  data; 

(2)  discussion  and  examples  of  aids  to  simplify  the  equation 
formulation  and  avoid  the  need  for  vectorization  noted  in  (1) ; 

(3)  general  flow  chart  of  the  program; 

(4)  representative  example  involving  finite  elements  and 
using  backing  store  and  formulation  aids; 

(5)  detailed  discussion  of  algorithms  and  formats  used 
in  the  sparse  solvers. 
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B . General  Program  Description  and  Use 

1.  Program  description 

The  software  package  is  divided  into  three  operationally 
distinct  parts. 

(11  VFGF.S  (VEctorized  General  Equation  Solver),  operating 
without  a backing  store; 

(2)  VF.GES/P,  which  partitions  the  matrix  solution  to  utilize 
backing  store,  and 

(3)  UNBLOK,  a symbolic  preprocessor  that  accepts  a variety 
of  user  descriptions  of  the  matrix  structure  and  produces  vector- 
ized arrays  to  VEGES  or  VEGES/P  to  aid  the  numerical  equation 
formulation  procedure. 

The  specifications  for  VEGES  and  VEGES/P  are  given  in  Table  11. 

It  is  worth  noting  VEGES/P  requires  considerably  more  program 
storage  than  VEGES,  so  that  the  latter  should  be  used  when  part- 
itioning is  not  required. 

2.  Example  use  of  VEGES 

For  the  reader  unfamiliar  with  the  use  of  sparse  equation 
solvers,  an  elementary  example  will  depict  a minimal  programming 
effort  necessary  to  utilize  this  software.  To  distinguish  the 
symbolic  and  numeric  solution  phases,  separate  main  programs 
are  used  for  each  phase;  communication  between  phases  is  provided 
through  a backing  store.  Since  the  symbolic  phase  need  be  ex- 
ecuted only  once  for  a given  matrix  structure,  this  separation 
insures  that  program  and  array  storage  associated  with  only  the 
symbolic  phase  will  not  burden  the  time-consuming  numeric  phase. 
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1.  Name:  VEetorized  General  Equation  Solver  (VEGES  without 


partitioning,  and  VEGES/P  with  partitioning). 


2.  Purpose:  To  perform  direct  solution  of  arbitrarily- 


structured  sparse  simultaneous  linear  equations, 


either  with  or  without  a backing  store  (partition- 


3.  Language:  IBM  extended  Fortran  (see  IBM  document  GC28-6515- 


10);  principal  extensions  from  ANS  Fortran  are 


IMPLICIT,  REAL*8,  and  INTEGER* 2 declarations. 


4.  Operating  system:  Development  and  testing  performed  on 


Michigan  Terminal  System. 


5.  Availability:  Source  language  programs  available  from 


Professor  Calahan  on  9-track,  800/1600/ 


6250  BPI  (1600  default),  IBM  standard  labeled 


(default)  or  unlabeled  magnetic  tape. 


6.  Program  limitations:  up  to  32768  equations;  this  limit  can 


be  changed  by  altering  INTEGER*2 


array  types  to  INTF;GER*4. 


7.  Program  storage  (bytes)  VEGES:  Symbolic  - 5718 


Numeric  - 3976 


VEGES/P:  Symbolic  - 18250 


Numeric  - 15300 


1/0  management  routines  - 12456 


Table  11.  Program  specifications. 


T ^*1 


The  example  solved  by  the  program  of  Table  12  is 


1.  2.  1. 

X1 

1 

CO 

1 

0.  5.  2. 

X2 

n 

16. 

2.  0 5. 

X3 

17. 

_ 

where  [x1  x2  x^]  =[123]  is  to  be  found  using  [1,3), (2, 2), 
(3,1)  as  pivot  positions.  Beginning  with  the  column-ordered 
vectorized  matrix  structure  (recall  Table  2) 

IA:  -1,-3, 1,2, 1,3 
JA:  1,3, 5, 7 

and  the  pivot  positions 

I PC:  3,2,1  IPR:  1,2,3. 

first  the  order  of  IPR  is  inverted  (IPRI (IPR(J) ) = J and  the 
symbolic  processor  VMSP  produces  a set  of  seven  arrays  passed 
to  the  numeric  phase.  In  the  numeric  solution,  these  are 
combined  with  the  column- ordered  matrix  and  right-hand  side 
A:  l.,2.,2.,S.,l.,2.,5. 

B:  8. ,16.  ,17. 

in  VMNP  and  VMBPC  to  produce  the  solution.  The  flow  chart  is 
shown  in  Figure  11. 

It  should  be  noted  that,  although  the  example  depicts 
column-ordered  structure,  row  ordering  can  be  accomodated  by 
using  subroutine  VMBPR  for  the  substitution  steps.  The  svmbolic 
and  numeric  data  would  have  the  row-ordered  form 
IA:  1,3, 2, 3, -1,-3, 

JA:  1,3, 5, 7, 

IPR:  3,2,1,  I PC:  1,2,3, 

A:  1. ,2. ,1. ,5. ,2. ,2. ,5. 
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i * • ♦ * GYM  1*01. 1 C PHASE 

♦ HIMINSIONS:  IVA<  NP  t ) * JU(NPl  > * JL  (NPl  ) » IVUCNPl  ) , IVL  < NPl  ) * IP(N) 

1 \ I «.  N > . 1 VI.  * N ' . i I . n ) . ! I K I ( N)  » I PC(N  ) . IL’NT  < N,  7)  , JA  ( NPl  ) 

DIM!  NS  1 ON  in  1 A NO.  HI  POST  HUNG  IN  MATRIX 
l.  IUMLNMUN  H JU  » JL  - NO.  OF  POSITIONS  IN  U»L 

INI  L lit K **}  JA(  4 >/l  ,3.S»  :v  . IVA(4 ) , JU<  4)  » JL  ( 4 ) , IVU(4)  , IVL<4) 
iNUGLK*.'  IA(*S>/  1,  ',.1,2.1  ,3/,IiJ(9),IL(9),ICNT(3r7)p 
UP( 4) , IXT (4> . IXB(4 ) , IPC( 3)/3,2p 1/t IPR(3>/1 »2»3/, IPRI<3> 

NDSRN- 1 
N=3 

MAXS=? 

MAXN=V 

r***i  INVERT  ROW  PIVOT  VECTOR 
' . *1  I PTVI2(  JPRp  7PRI-N) 

♦ * ' f ‘.'i'll  I SrhlMJi.lC  I 'KllORAM  TO  GENERATE  MAP 

l ' A I I VMSP(N, JA, I A,  IVA. JU , JL , 1U, IL» IVU. IVL, I XT, IXD, IP, 

MEN!  .MmXS,MAXS,MAXN,MAXN,N,  IPCp  IPRI) 

NJ  1 NF1 

CH***  DIVISION  BY  2 FOR  *2  ARRAYS 
I < - ( JU'NFl > + l >/2 
I i ( JL  < Nf - 1 ) i 1 )/2 
r ( jA(NI  1 > fl  ) / 2 

('**»•  Gi  HE.  ARRAY  DIMENSIONS  FOR  NUMERICAL  SOLUTION 

. L IF (6.2) I VJ  ( Ni  t ) . I VL (NPl  ) » JU(NF  1 ) , JL (NPl ) , JA(NP1 ) 

format < U-.I5/  i.-'»I5/'  iu-'-is/'  il-'pIS/'  matrix-' p is> 

r**P*  A VL  SYMDOLiC  RESULTS  ON  DISC  (FOR  ILLUSTRATION  PURPOSES) 

• * * * • • t ■ ' rr  Ai.  t.  •>  v i . Mr;  rw*; 

WK  i 1 1 « NDSRN  , i)  l .4 , 1 4 , lo 
A FORMAT <3 IS) 

l ALL  UR T ( JA , NP l » NDSRN ) 
i.AI  L URT  < IA,  IS.  NDSRN) 
l AL l URT < 1 VA , NPl , NDSRN ) 

FAl  i URT (JU. NPl -NDSRN) 

ALL  URT( JL ,NPi , NDSRN) 

. ,1  URT ( IU  t I 3 -NDSRN ) 

Ci  it  I Wl\  T(IL,I4p  NDSPN  > 

( All  URT ( IVU, N, NDSRN) 
i ALL  WRT( I VL , N , NDSRN ) 
r-Ji  UR 7 ( ItCfN'NUSKN) 

CALL  WkT( IPRI ,N, NDSRN) 

STOP 

END 

51  -'I  A GUT  I NE  URT  ( I , NT  OT  , NDSRN  ) , 

C**»*  .LIN  IS  LENGTH  OF  WRITE  IN  4 -BY  IE  WORDS  1 cl  D 1 C 1Z  . 

INTKE.ER*2  ILEN 

I OUT  CAL  HI  1(1) 

N T 0 T 1 N TOT *4 

N • A ‘I  L’  ~ 1 

1 NT  j 1 2 MINCKNTOT  1 ,52000) 

T Lf  W -NI0T2 

***  ' ! u i,  ;J  T I AL  URTTC.  . . 320C0  BYTES  PER  RECORD 
i .M  E VJ  TE(  1 (ND.VCT  ) , 1 LCNpOp  IDUMM-NDSRN) 

Nl.ill  NT0T1  T2V00 
. MHASE  132000 

II  ( N 1 1)1 2 • €() . 32000)  GO  TO  1 
RL TURN 

ENU 

*•<♦♦  ..'Ufiif  lC  PHASE 

♦ i*t  . i f oil  MU  U,i  . IIJ,  fl  . I A PER  PREPROCESSOR  PRINTOUT 

i.l  ,f  ♦:  A*  n/\  . DO  • 2 . DO  , 2 • DO  t S • VO  t 1 . VO,  2.  DO,  5 , VO/  * U<  4 ) rL  ( A ) r 
r.iKJ'-IK  .”  ST . !"'•»  IA.  DO*  1 ? . DO/  , X ( 3 ) 

INTEGER *4  JA ( 4 ) p I VA ( 4 ) , JU < 4 ) , JL < 4 ) , I VU ( 4 ) , I VL < 4 ) 

INTEGER *2  IA(7)»IU(4),IL(4). I PC  < 3 ) » IPRI (3) 

NDSRN=1 

N-3 

NPl =N+ 1 

READ  DATA  FROM  PREPROCESSOR 
READ (NDSRN, 1 ) 13, 14, 15 

1 FORMAT C 315) 

CAI  l RDT ( JA, NPl .NDSRN) 

FALL  EDf CIA, IS » NDSRN) 

F ALL  RDC<  [VA, NT  1 -NDSRN) 

CALL  RDE( JU.NFt pNDSRN) 

CALL  K DE ( JI  , NPl, NDSRN) 

CALL  T TC ( IUp I3-NUSRN) 

. ; R [.:-••  r:  . v«;.ur,=FN) 

CAli  ‘ IVLI-N, NDSRN) 

CALL  RDE ( I VL  » N . NDSRN ) 

CALI  RDET IF C.N, NDSRN) 

LALL  RDE ( IPRI ,N, NDSRN) 

CALL  VMNP(N,  JA,  I A , I VA  , A p JU » JL  , IU  r II. , D I . U , L • X , I VU , IVL. 

SI  PC, IPRI ) 

CALL  VMBFC ( N , JU  p JL  p I U , I L p I VU  * I VL , D 7 * U , L r B p X , I PC » 1 PR  I ) 

UP  I T E ( 6 » 2 ) ( B < J ) , J - 1 » N ) 

2 FORMAT (3E1 5. /) 

STOP 

END 

SUBROUTINE  RDF ( I , ILFN , NDSRN ) 

C**»*  ILF  / rS  LENGTH  Or  READ  IN  4 -BYTE  WORDS 
INirr.ER*?  I XCN 
LOG ICAL*1  1(1) 

NR l I LEN ♦ 4 
N7fAS£  = l 

1 t1  g2  MIN0(N01 ,32000) 

IXEH-N82 

«**«  . , M T f AL  READ 

C I.  r r AB  ' I (NBASE  ) . IXLNpO*  1DUMM.  NDSRN) 

Ni-1  NO  1 - (2000 

nT a i NBASE  F 32000 

;r  (fiu.'.F  0. 32000)  GO  TO  1 


Fortran  program  to 
solve  Equation  32 
using  VEGES. 
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I A , JA  (vectorized) 
IPRI , IPC 


\ 

r~ 

| VMSP 

' 

IVA 

IU, JU, IVU 
IL.JL.IVL 


B 


A matrix  structure 
Pivot  order 


Symbolic  phase 


L and  U vectorized 
structure,  written 
to  backing  store 


Numeric  factorization 


Factored  matrix,  diagonal 


Forward  and  back 
substitution 


Solution  vector 


Figure  11. 


Flow  chart  of  non  partitioned  vectorized  sparse  matrix 
solver . 
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C.  Simplified  Equation  Formulation 


\ 

i 

>» 


1.  Introduction 

From  a formal  viewpoint,  a general  sparse  equation  solver 
can  reasonably  require  that  the  matrix  be  stored  in  standard 
order  (e.g.,  row  or  column)  in  local  or  backing  store  before 
the  solution  is  initiated.  The  burden  of  formulating  the 
equations  in  this  order  is  left  with  the  user,  so  this  rationale 
goes,  because  the  variety  of  equation- formulation  procedures 
is  simply  too  large  to  accomodate  in  a general  way.  Unfortunately, 
this  standard-ordering  requirement,  together  with  the  unavoidable 
necessity  of  describing  the  sparsity  structure,  has  made  the 
conversion  from  full-  and  band- or iented  methods  to  general- 
sparsity  methods  worthwhile  only  for  large  problem-oriented 
analysis/design  packages  where  the  user  is  isolated  from 
the  demands  of  the  sparse  equation  solver. 

Several  aids  to  the  equation  formulation  have  been  produced 
to  avoid  this  column- ordering  requirement  on  the  user.  Their 
use  is  illustrated  in  the  following  examples.  In  this  development, 
a distinction  will  be  made  between  a scalar  structure,  where 
individual  elements  of  the  matrix  are  described  by  their  (row, 
column)  position,  and  a vector  structure,  where  submatrices 
(blocks)  are  indicated  by  several  descriptors. 

2.  Scalar  case:  example 

To  exemplify  a typical  conversion  from  a full-  (or  band-) 
matrix  solution  to  a general  sparsity  solution  using  the  formu- 
lation aids,  consider  the  elementary  example  of  Figure  12 
(the  reader  is  assumed  to  be  familiar  with  the  example  of  Table 
2).  The  key  steps  in  this  process  are  the  following. 
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numeric  factorization  portion  of  VF.UF.S. 


formulation  and  symbolic  processing. 


1)  The  user  must  supply  explicit  structural  data,  hut 
the  structure  may  be  in  any  order  and  in  a variety  of  forms; 
the  scalar  ( i , j ) form  is  illustrated  in  figure  12.  This 
structure  is  most  easily  produced  by  creating  a new  version 
of  the  full  matrix  formulation,  but  replacing  each  numerical 
formulation  step  with  a symbolic  formulation  step  (a  four- 
step  formulation  is  shown);  this  symbolic  data  (array  NA)  is 
preprocessed  in  the  symbolic  phase  of  figure  12. 

2)  because  the  sparse  solver  requires  column  (row) 
ordering  of  numerical  data,  but  the  equations  in  general  may 
be  formulated  in  any  order,  a mapping  array  MAP  is  generated 
by  subroutine  UNBLOK,  with  an  element  for  each  numeric 
formulation  step,  so  that  at  the  ith  formulation  step, 

A(MAP(1))  is  calculated  by  the  user.  Alternatively,  if  the 
numeric  values  are  stored  in  an  array  B (viz,  B(l)  = 3., 

B ( 2 ) = 7.,  etc),  the  A matrix  array  is  formed  by 

DO  1 J= 1 , 4 

1 A (MAP ( J) ) = A (MAP (J))+B(J)  (31a) 

Note  that  the  mapping  array  increases  the  storage  between 
25%  and  100%,  depending  on  the  word  sizes  of  symbolic  and 
numeric  data. 

3)  Duplicate  positions  are  often  created  in  the  determina- 
tion of  array  NA,  as  when  several  numeric  values  must  be  added 
to  produce  a single  matrix  position.  This  duplication  results 
from  the  creation  of  a matrix  position  associated  with  several 
physical  components.  As  illustrated  in  Figure  12,  this  sit- 
uation is  readily  handled  with  the  MAP  array,  where  MAP(2)  = 

MAP (4)  = 3. 
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3.  Scalar  case:  algorithm 

In  addition  to  the  storage  introduced  by  the  MAP  array, 
another  critical  question  is  the  computational  complexity 
of  the  MAP  generation.  This  proceeds  immediately  from  a 
study  of  the  algorithm. 

Let  a user- suppl ied  n x n matrix  structural  and  numerical 


description  be  described  by 


(row, column)  position  = C i ^ » j ^ ) 


k=l , 2 , . . . m 


a . , - 

Jk 


To  arrange  in  column  order,  define 
= ik  + (n+l)(jk-l) 

where  duplicate  values  of  are  allowed.  If  the  values  of 
are  sorted  in  ascending  order  and  a permutation  vector  p^ 


maintained  so  that 


„ JL 

Pr  * Pk 


r = 1,2,3,. . . k- 1 
k = 2 , . . . m 


then  the  sequence 


b , b , . . .b 

Pi  P2  Pm 

is  a column-ordered  list  of  numerical  values,  with  possible 
duplication  of  positions  being  adjacent  in  this  list. 

Now  assume  that  £ = i implies  that  b and  b 


are  to  be  added  to  compose  a sparse  matrix  position.  Define 


the  set 


(q)  = (r  : 1 < r < m,  l ? i 


Here,  q^,  k=l,2,...s,  points  to  a set  of  unduplicated  values 


■» 


, 

u 


ji 

’■€ 


of  K . The  column-ordered  matrix  positions 

Pr 

without  duplication,  by 

5V  ■ r*„  /("*oi  * i 

p'«k 


are  then  given, 


i , = X. 
k p 


- (n+1) ( j k* 1) 

k = l , 2 , . . . s 

MAP  is  generated 

by  defining 

w = 1 

Pi 

w = w 

pr  pr-l 

X = 

Pr  Pr-l 

r=2 , . . .m 

= w +1 

Pr-l 

Pr  Pr-l 

(331 

Then  the  packed  sparse  array  A is  calculated  columnwise  from 

k=l  , 2 , . . .m 


c c + b, 

wk  Wk  k 


Thus  the  set  performs  the  function  of  the  MAP  array. 

The  complexity  of  the  above  computations  is 

(1)  0(mlog2m)  for  performing  the  sort,  and 

(21  0(m)  for  performing  the  scans  of  (321  and  (331. 

The  sort  easily  dominates  the  computation.  Since  m=0(nl, 
for  finite  element  grids,  this  complexity  is  Ofnlopnl. 

4.  Vector  case:  introduction 

Large  sparse  equations  are  usually  most  easily  formulated 
in  blocks,  each  relating  clusters  of  equations  and  variables. 
These  blocked  submatrices  must  then  be  arranged  in  column 
order  within  the  block,  and  then  these  columnwise  representations 
inserted  into  the  overall  column- ordered  matrix  structure. 

This  unblocking  process  - the  primary  function  of  subroutine 
UN BLOK  - is  complicated  by  the  following  factors. 


(621 


T i mes 


files  of  Backing  Store 


1)  Blocks  may  be  generated  in  any  order. 

2)  Blocks  may  in  general  overlap  in  either  or  both 
dimensions  (however,  see  [20]  for  methods  of  eliminating 
overlap  at  a cost  in  matrix  size). 

3)  Blocks  may  have  an  internal  sparsity  structure  worth 
preserving,  e . g ., diagonal , tridiagonal,  etc. 

4)  For  very  large  matrices,  the  blocks  may  not  be  con- 
tainable in  local  store,  but  must  occasionally  be  written  to 
backing  store. 

To  reduce  user  concern  for  these  issues,  the  preprocessing 
subroutine  UNBLOK  accepts  a high  level  description  of  the 
block  structure,  and,  similar  to  the  scalar  case  of  Figure  12, 
produces  both  (a)  a vectorized  matrix  symbolic  description  for 
VMSP,  and  (b)  a vectorized  mapping  array  to  assist  equation 
formulation. 

Since  the  primary  use  of  the  blocking  feature  is  expected 
to  be  in  the  solution  of  finite  element  problems,  a simplified 
format  has  been  provided  in  this  case.  Entering  only  the 
node  numbers  of  the  element  lesults  in  the  necessary  vectorized 
matrix  description  and  formulation  map,  assuming  all  nodes  in 
the  finite  element  are  coupled  in  the  element  matrix  description. 
Boundary  conditions  are  also  handled  at  this  level. 

A summary  of  scalar  and  vector  formats  acceptable  to 
UNBLOK  is  given  in  Table  13. 

Unpartitioned  form 

Temporarily  ignoring  the  problem  of  incorporating  a backing 
store  when  the  matrix  cannot  reside  in  local  store,  two  examples 
will  be  studied  to  illustrate  the  correspondence  between 

symbolic  and  numeric  arrays  and  their  use. 
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jffgwr 


The  first  example  of  Figure  13  illustrates  the  processing 
of  a variety  of  overlapping  block  structures.  The  user 
supplies  a symbolic  array  NA  to  the  symbolic  preprocessor 
UNBLOK  and  a numeric  array  to  the  equation  solver.  The 
latter  array  must  be  column  ordered  by  the  user  within  each 
block,  regardless  of  the  block  structure.  Thus,  a finite 
element  must  be  formed  in  column  order,  usually  a minor  restriction. 


Structural  Description 


Format 


Scalar  in  (i,j)  position 


Dense  vector  in  column  j 

from  row  i^  through  row  i^ 


Finite  element  connecting 
nodes  k^ , • • • • 

and  boundary  conditions 

i.  l7  . . .1 

l,z,  m. 


Dense  (q  = 1 ) I 0 

l block  between  Q 

Diagonal  (q=2)  > positions  (i1,j1>)  i 

Tridiagonal  (q  = 3)J  ant* 


(able  13.  Format  of  NA  array  input  to  UNBLOK, 

assuming  column-ordering  (using  VMBPC, 
ZMBPC) . 


The  formulation  process  may  be  vectorized  with  a potential 
savings  in  storage.  Rather  than  generating  a single 
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array  MAP,  one  car.  produce  both  (a)  MAP(I)  which  points  to 
the  beginning  of  a column- ordered  vector  in  the  packed  matrix 
array  A,  and  (b)  LEN(T)  which  gives  the  length  of  this  vector. 
The  A array  is  then  loaded  from  the  B array  in  this  manner. 

N=0 

DO  1 J= 1 , KT 
IV1=MAP ( J) 

IV2=LEN(J)+IV1-1 
DO  1 I=IV1 , IV2 
N=N+1 

1 A(I)=A(I)+B(N)  (341 

As  in  the  scalar  case,  (Eq.  (31a)),  the  array  B need  not  be 
formed  in  its  entirety  before  transfer  to  A.  I-'or  example, 

B need  store  only  the  numerical  components  of  a single  block 
or  finite  element;  so  long  as  the  J and  N counters  of  (34)  are 
maintained,  the  innerloop  can  be  entered  at  any  time. 

The  choice  of  using  scalar  or  vector  representation  of 
(31a)  or  (34)  depends  on  the  average  vector  length  (L  ) of 
the  components  of  the  3 array.  Since  this  Lave  will  be  less 
than  the  L for  the  completed  matrix  which  in  rn  will  be 
less  than  the  Lav0  of  triangular  factors  L and  U due  to  fill, 
the  scalar  mapping  procedure  of  (31a)  is  expected  to  be  pre- 
ferred in  the  majority  of  cases. 

The  second  example  of  Figure  14  illustrates  (a)  the  sim- 
plicity of  using  the  high  level  finite  element  feature,  and 
(b)  a simplified  method  of  incorporating  boundary  conditions. 

In  this  example,  all  elements  are  assumed  triangular  with  two 
unknowns/node.  The  related  matrix  structure  is  shown  by  x's, 
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Figure  13.  Example  of  column-ordered  matrix  preprocessing  of  structural 
data . 


(66) 


separator 


for 


three  partitions 


(a)  Triangular  finite  element  structure 


x x 
X x 

X X 
X X 
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X X 
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X X 
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X X 


X X 
X X 
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(b)  Matrix  and  NA  array  without  boundary 
condi t ions 
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(c)  Partitioned  matrix  and  NA 
array  with  boundary 
conditions  on  variables 
2,  7,  9. 


Figure  14. 


Example  of  row-ordered  partitioned  finite  clement 
problem  and  matrix. 
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and  the  NA  array  is  also  given.  If  the  boundary  conditions 
are  of  the  Dirichlet  tvne  on  variables  2,7,  and  9,  these  num- 
bers can  simply  be  appended  to  the  NA  array  as  shown.  If  row- 
ordering is  used  (VMBPR) , subroutine  UNBLOK  will  then  generate 
appropriate  vector  descriptors  to  numerically  zero  elements  in 
the  rows  associated  witli  the  boundary  condition  variables. 

A unit-valued  diagonal  will  then  force  the  variable  to  the 
right  hand  side  value.  An  example  will  be  given  shortly. 
Partitioned  form 

When  the  A array  cannot  be  kept  in  local  store,  it  must 
be  formed  in  partitions.  If  the  matrix  is  row-ordered  as 
required  for  incorporation  of  boundary  conditions,  the 
partitions  must  also  be  along  row  boundaries,  as  illustrated 
by  the  three  partitions  (P1-P2-P3)  of  Figure  14. 

Each  partition  is  formed  in  a buffer  region  of  local  store 
when  a partition  is  completely  formed,  it  is  written  out  to 
backing  store,  and  the  buffer  region  is  overwritten  by  the 
next  partition.  Continuing  with  the  example,  assume  that 
each  finite  element  is  formulated  before  beginning  the  next. 
Then  the  coupling  of  elements  sharing  nodes  within  the  element 
array  requires  that  at  least  one  partition  be  resident  in 
local  store,  representing  the  coupling  of  past  and  future 
partitions.  Thus,  PI  representing  node  1 of  Figure  14  can- 
not be  written  to  backing  store  until  finite  elements  a and  b 
have  been  formed.  Then  the  separator  2-3-4  isolates  succeeding 
elements  from  node  1,  so  that  PI  can  be  written  to  free  space 
for  P3 . 

A subtlety  in  this  buffering  scheme  is  introduced  by 
distinguishing  the  extent  of  concurrency  bv  the  processor  in 


the  formulation  and  I/O  processing.  If  the  central  processor 
performs  the  I/O  as  well  as  the  numerical  equation  formulation, 
then  only  two  buffer  regions  need  he  defined.  for  example, 
in  the  5-partition  problem  of  Figure  19,  to  be  discussed  in 
detail  later,  the  formula! ion- I /0  sequence  with  two  buffers 
would  proceed  as  follows. 


Set  up  elements  Uj  - ag  in  buffers  1 and  2 
Write  buffer  1 (cols.  1-10) 

Set  up  elements  b1  - bg  in  buffers  2 and  1 
Write  buffer  2 (cols.  11-20) 

Set  up  elements  Cj  - cg  in  buffers  1 and  2 
Write  buffer  1 (cols.  21-30) 

Set  up  elements  dx  - dg  in  buffers  2 and  1 
Write  buffers  2 and  1 (col.  31-46) 


On  the  ot^cr  hand,  if  the  I/O  is  handled  by  a separate  pro- 
cessor and  a buffer  region  cannot  be  simultaneously  filled 
and  drained,  three  buffers  must  be  used.  The  example  would 
now  proceed  as  follows. 
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5.  Vector  Case:  algorithm 

The  subroutine  UNBLOK  operates  on  user- suppl ied  block 
descriptions  as  follows.  Since  the  blocks  produced  in  the 
vector  case  have  an  internal  sparsity  structure,  the  first 
step  in  column-ordering  the  entire  matrix  is  to  column-order 
This  is  carried  out  "on  the  fly",  i.e.,  as 


each  block. 


each  block  is  recognized  in  the  user  description,  an  expanded 
vector  description  - beginning  and  ending  row  numbers  of  each 
dense  column  segment  - is  generated  in  column  order  for  the 
block.  In  the  case  of  a finite  element,  the  entire  submatrix 
is  column- ordered , irrespective  of  the  block  structure. 

The  arrangement  of  these  vectors  into  a column-order  for 
the  entire  matrix  has  two  steps. 

(1)  An  array  of  beginning  row  numbers  for  each  vector 

is  saved  from  the  above  process.  Together  with  the  correspond- 
ing column  number,  these  "scalar"  descriptions  of  each  vector 
starting  position  are  sorted  as  described  in  the  scalar  case. 

(2)  The  column-ordered  list  of  starting  vector  positions 
is  scanned  a column  at  a time.  For  each  co  :mn,  starting  and 
ending  row  positions  of  vectors  are  noted,  and  overlapping 
vectors  are  combined  to  form  the  final  compacted  vectorized 
matrix  structure. 

The  sorting  is  again  the  dominant  computational  effort. 

If  there  are  m^  blocks  with  an  average  of  m,  col umns/block , 

a*' 

then  the  complexity  is  Ofm^m, log 7 (m jm^l ) . 

D.  Program  Flow  Charts 

Having  illustrated  the  use  of  the  unpartitioned  solver 
VEGES  and  the  symbolic  preprocessor  UNBLOK,  we  may  now  view 
the  flow  chart  of  the  complete  svstem  package  with  some  under- 
standing before  considering  in  detail  the  more  complicated 
partitioned  version  in  VEGES/P.  Particularly  noteworthy  in 
Figure  15  is  the  IBM  360/370  assembly  language  version  of  the 
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2).  The  key  steps  in  this  process  are  the  following. 
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numeric  factor  i :at  ion  portion  of  VHGHS. 

E . Details  of  the  Partitioned  Solution 

1 . Int roduct ion 

The  implementation  of  a backing  store  version  of  a general 
equation  solver  necessitates  a fundamental  assumption  concern- 
ing the  size  of  local  store  versus  the  size  of  the  system  of 
equations.  In  contrast  to  specialized  full-,  band-,  and 
block-solvers  where  the  structure  is  given,  sparse  solvers 
must  have  ready  access  to  both  numerical  and  structural  data; 
this  implies  that  a decision  must  be  made  on  whether  to  main- 
tain either  or  both  completely  in  local  store  or  whether  only 
a local  description  - adequate  for  a local  computational 
sequence  - should  be  maintained,  the  rest  residing  on  backing 
store.  A similar  consideration  is  involved  in  the  equation 
formulation  stage,  i.e.,  whether  the  matrix  (and  the  MAP  array 
if  UNBLOK  is  used)  is  formulated  and  written  to  backing  store 
in  parts.  In  general,  a tradeoff  must  be  made  between  flex- 
ibility in  use  when  all  potentially  large  arrays  are  partition- 
able,  and  user  convenience  wh  n ill  arrays  are  resident 
in  local  store. 

In  VEGES/P,  it  is  assumed  that  arrays  associated  with  the 
symbolic  solution  phase  are  in  local  store  but  all  other  large 
arrays  are  in  backing  -tore.  Table  14  details  this  assumption 
by  giving  the  residency  status  of  A,  L,  U,  etc.  in  the  form- 
ulation and  solution  stages.  The  justification  for  this  choice 
is  that  the  vectorized  structural  arrays  arc  likely  to  require 


(72) 


(59) 


at  least  an  order  of  magnitude  less  storage  than  numeric  arrays 


1 array 
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U 

structural 

L 
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B 
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(a)  Formulation 


*L  during  symbolic  phase 
(b)  Solution 


able  14.  location  of  arrays  in  UNBLOK  and  VEGF.S/P;  I. 
R - backing  store. 


local  store 


for  problems  necessitating  a backing  store  (the  ratio  being 
4n  for  L and  U in  the  finite  element  problem  class  of  Appendix 
table  A1 ) . 

2.  flow  chart  of  UNBLOK  symbolic  preprocessor 

The  UNBLOK  subroutine  originally  cited  in  Figure  12,  has 
the  ability  to  partition  the  MAP  and  LEN  arrays,  as  indicated 
in  Table  14,  writing  these  to  a backing  store  one  partition  at 
a time.  Thus,  these  formulational  arrays  do  not  seriously  affect 
the  storage  requirements  for  the  symbolic  phase  of  the  solution. 

Other  noteworthy  features  of  UNBLOK  are  the  ability  (a)  to 
produce  either  scalar  or  vector  mapping  arrays  (see  Eqs.  (31a) 
and  (34)),  and  (b)  to  identify  rows  which  are  to  be  zeroed  in 
the  manner  of  Figure  14c,  to  handle  boundary  conditions. 

3.  Flow  chart  of  ABLOK  numerical  preprocessor 

The  numeric  formulation  phase  is  complicated  somewhat 
by  partitioning  of  the  formulation  step,  since  the  MAP  and 
LEN  arrays  are  on  backing  store  (from  UNBLOK  above);  then 
retrieval  must  be  coordinated  with  the  formulation  of  components 
of  A. 
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Figure  16.  Flow  chart  of  UNBLOK  subroutine 
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Figure  18.  Flow  chart  of  partitioned  solution. 


This  coordination  can  be  generalized  provided  that  the 


components  of  A are  also  on  a backing  store  in  the  order 
consistant  with  MAP  and  LEN.  This  is  performed  by  subroutine 
ABLOK  with  flow  chart  given  in  Figure  17.  Note  that  use  of 
this  routine  precludes  formulation  of  the  matrix  "on-the-fly" 
from  its  components,  and  requires  an  additional  read/write 
cycle  in  the  equation  formulation.  This  price  for  generality 
is  expected  to  be  small  for  fast  scalar  processors  but  could 
be  significant  for  vector  processors.  At  a minimum,  ABLOK 
provides  an  example  of  the  coordination  process  for  the  user 
with  an  alternate  strategy. 

4.  Flow  chart  of  symbolic,  numeric  solution 

The  symbolic  and  numeric  formulational  preprocessors  leave 
the  matrix  structure  and  values  on  a backing  store,  respectively. 
The  symbolic  and  numeric  solution  then  proceeds  in  five  major 
steps . 

(1)  The  columns  which  initiate  the  partitions  are  selected. 
These  column  breaks  are  determined  in  an  addition  to  the  symbolic 
processing  phase  by  user  specification  of  either  (a)  column 
numbers,  thus  accepting  whatever  partition  storage  requirements 
that  result,  or  (b)  maximum  strip  storage  size,  permitting  the 
column  breaks  to  be  selected  by  an  internal  algorithm.  This 
interactive  partitioning  step  is  depicted  in  the  block  diagram 

of  Figure  18  and  will  be  illustrated  shortly. 

(2)  The  matrix  A is  read,  reordered,  and  written  to  back- 
ing store  with  zero-valued  positions  of  L and  U inserted. 

( 3)  The  L and  U are  formed  in  column  - ordered  strips,  each 
strip  being  written  to  backing  store  on  completion  and  recalled 
when  necessary  to  form  another  column  strip. 


J 
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JS2H0T 


(4)  The  strips  of  first  1.  ami  then  l)  are  recalled  in  sequence 
to  carry  out  the  forward  and  hack  substitution  steps.  These 
steps  (F-H)  are  shown  in  the  flow  chart  of  Figure  16,  together 
with  the  specific  reads  and  writes  to  backing  store  of  both 
numeric  and  symbolic  information  (the  times  given  are  referenced 
in  Appendix  table  A4) . 


5.  Example  of  partitioned  solution  of  finite  element  problem 
The  finite  element  grid  of  Figure  19  presents  a sufficiently 


complicated  problem  to  illustrate  the  major  features  of  VEGES/P 
and  the  relative  simplicity  of  solving  this  large  class  of 
application  problems. 


Separator 


6-8-10  11-13-15  16-18-20 


Initial  column 
(2  var./node) 


Figure  19.  Finite  element  array  and  partitions  for  UNBLOK. 
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The  symbolic  data  presented  to  UNBLOK  consists  of 
principally  the  finite  element  node  numbers,  the  boundary 
variable  numbers,  and  partitioning  information  for  both  the 
variable  numbers  and  the  finite  element  node  number  list.  The 
precise  data  requirement  is  shown  in  Table  IS.  Note  that  these 

N,  the  size  of  the  matrix 
IVEC  - 0 for  scalar  formulation 
- 1 for  vector  formulation 
NUMBUF,  the  number  of  buffers 
NPART,  the  number  of  partitions 
NBC,  the  number  of  boundary  conditions 
NVAR,  the  number  of  variables/node 
NNODE,  the  number  of  nodes/element 
both  in  COMMON/FEL/ 

NA(NNODE*NVAR* (no.  of  elements) + NBC) , 

list  of  element  node  numbers  and 
boundary  conditions 

IPART (NPART+ 1) , beginning  column  (row) 
numbers  of  each  partition 
IXP (NPART+ 1 ) , pointer  into  NA  indicating 
beginning  of  new  partition 

Table  15.  List  of  input  data  to  UNBLOK. 

partitions  pertain  only  to  the  equation  formulation  step.  In 
this  example,  this  data  is  furnished  by  DATA  statements. 

As  the  flow  chart  for  the  example  shows  (Figure  20) , the 
symbolic  phase  can  be  executed  through  ZMSP  (the  preparation 
of  the  LIJ  map)  with  this  minimal  information.  The  partitioning 
of  the  matrix  solution  itself  is  next  carried  out  interactively. 
The  interaction  is  illustrated  in  Figure  21,  where  two  partitions 
are  examined  - one  based  on  a maximum  buffer  size,  and  one  on 
specific  column  breaks  (which  happens  to  be  identical  to  the 
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column  breaks  specified  in  the  formulation  stage  in  this  ease). 
The  choice  of  a partition  results  in  the  writing  of  the  L and 
U maps  as  well  as  the  presentation  of  certain  critical  dimen- 
sioning information  for  the  numeric  solution  phase. 

The  finite  element  is  evaluated  and  written  to  backing 
store  in  the  same  order  as  presented  in  array  NA.  These 
are  retrieved  together  with  the  MAP  and  LEN  arrays  in  ABLOK, 
which  then  writes  the  partitioned  A matrix  to  a backing  store. 

The  remainder  of  the  solution  follows  in  the  manner  of  Figure  18. 

The  main  programs  for  the  symbolic  and  numeric  solution 
phases  are  given  in  Appendix  C.  The  reader  will  note  that  in 
the  numeric  phase  the  user  need  supply  only  the  subroutine 
NFINI  for  evaluating  the  finite  element  matrix  and  the  right 
hand  side  vector. 


User  definitions  of 
finite  i lenen t problei 
ee  Tab!  LS) ; 
order  II' > . I i'K  ■ ■ ; i i K 1 


Call  UN BLOK  to  get  I A,  JA 
(symbolic  A matrix  description) 
and  map  of  finite  elements, 
blocks,  vectors  and  oculars 
into  numeric  A matrix  buffers 
corresponding  to  specified 
partitioning  on  logical  I/O 
unit  IMUN1T. 


Print  A buffer  dimension 


Call  ZMSP  to  perfoi  lie! 
LU  factorization,  return  l.U  I 
structure. 


Call  ZBRFAK  for  interactive 
partitioning  of  A and  LU 
symbolic  information 


Call  ZMSPO  to  complete  par- 
titioning and  write  out  nfor- 
mation  on  units  IAUNIT  and 
l NUN IT.  


(a)  Symbolic  Main  Program  includes  all  dimensioning  for  calls 
to  UNBLOK,  ZMSP,  ZBRF.AK  and  ZMSPO. 


Call  NFINI  to  write  numeric 
finite  elements  on  unit  NBUNIT , 
replacing  the  numeric  formulation 
stage  in  this  example. 


Call  ABI.OK  to  formulate  A 
matrix  from  finite  elements. 
Symbolic  mapping  is  from 
IMUNIT,  A numeric  matrix  is 
written  one  row  at  a time  on 
N'AUNIT.  Pile  pointers  to  each 
row  and  the  right  hand  side 
arc  written  on  IBUNIT. 


Call  ZMNP  to  perform  numeric  LU 
factorization,  reading  IAUNIT, 
N'AUNIT,  IBUNIT  and  INUNIT  and 
writing  I.  and  U partitions  on 
1 1- UNIT  and  IUUNIT,  respectively. 


(all  ZMBPR  to  perform  forward 
and  back  substitution  using 
II-UNIT,  IUUNIT  and  IBUNIT  and 
returning  the  solution  vector. 


Print  solution  vector  I 


(o)  Numeric  Main  Program  includes  all  dimensioning  for  call-, 
to  NFINI.  ABI.OK,  ZMNP  and  ZMBPR. 


Flow  charts  of  example  finite  element  solution  program 
of  Appendix  C. 
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Value  returned  from  UNB1.0K  for  A numeric  buffer  for  ABLOK. 

Symbolic  partitioning  routine  ZBREAK  permits  examination  of 
part 1 1 ion i hr  strategies  and  their  effect1!  on  numeric  buffer 
storage  and  1/0  requirements. 

Partitioning  cun  be  specified  by  either  ot  two  methods: 

(1)  by  entering  a maximum  array  storage  figure  which 
determines  column  breaks  by  limiting  the  total 
buffer  storage  to  the  given  number  (this  does  not 
include  other  required  arrays  of  length  N). 

(2)  by  entering  specific  column  breaks. 

The  choice  is  controlled  by  end-of-file  entry  by  user 

L STRT*  gives  number  of  first  L block  needed  in  LI’ 

Tac  t’or  Fzat  ion  of  this  block. 

MIN  I.  SIZE,  is  the  size  in  "ILENSY  units”  (IBM  360.370. 
AMDAHL  JToV/6:  halfwords)  of  L buffer  containing 

symbolic  and  numeric  information.  Since  previous  L 
blocks  are  needed  for  factorization  this  number  is 
critical  to  the  amount  of  I/O  done  in  the  numeric 
rout i nes . 

MIN  IXBUFF  is  the  size  in  ILENSY  units  of  the  major 
symbolic  and  numeric  buffer  required  for  this  partition. 

BUFFER  SIZES  are  presented  for  dimensioning  purposes  in 
the  numeric  main  program;  I/O  gives  a count  of  all  read 
and  write  operations. 
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•ZBREAK  headers  consider  the  matrix  to  be  column  ordered; 
for  row  ordering  exchange  "IT  for  "L”  and  "row"  for  "column. 


Equi valencing  is  suggested  on  the  basis  that  the  A and 
IBBUFF  arrays  must  be  kept  separate  but  can  overlap  IXBUFF. 
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Figure  21.  Run  of  symbolic  and  numeric  program  of  Annendix  C. 
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F . Fortran  implcmetation  (VEGES) 

1.  Symbolic  processing 

A Fortran  implementation  of  these  vector  methods  can  be  viewed 
in  three  distinct  steps  (see  Fig. 11).  Given  the  symbolic  map  of 
A in  IA  and  JA,  the  symbolic  processing  routine  VMSP  identifies 
all  fill  positions  and  creates  maps  of  the  L and  U matrix  factors. 
Array  results  IU  and  JU  from  VMSP  form  a vectorized  U symbolic  map. 
IVU(J)  is  an  index  pointing  to  the  start  of  the  J'th  column  of 
numeric  elements  in  the  (yet  to  be  calculated)  packed  U numeric 
array.  Similarly,  IVA  points  to  the  start  of  each  column  in  the 
packed  A numeric  array.  IL,  JL,  and  IVL  describe  the  L matrix. 

We  shall  examine  how  these  arrays  are  determined. 

The  method  for  finding  symbolic  fill  loops  through  each  column. 
Counters  are  kept  and  updated  for  symbolic  and  numeric  positions 
in  L and  U.  These  are  used  for  JU,  JL,  IVU,  and  IVL.  The  per- 
muted A column  is  converted  to  scalar  row  indices,  row  permuted, 
and  converted  back  into  vector  form.  The  code  then  loops  through 
each  U position,  examining  the  corresponding  L column  and  updating 
or  inserting  vectors  in  the  current  column  to  account  for  fill 
locations.  When  there  are  no  more  U positions,  the  vector  repre- 
sentation of  this  column  is  a map  of  LU  for  that  column.  It  is 
split  at  the  diagonal  and  copied  to  IU  and  IL.  The  symbolic  pro- 
cessing of  this  column  is  now  complete. 

2.  Numeric  factorization 

The  A numeric  array  and  all  symbolic  information  is  input  to 
the  numeric  factorization  routine  VMNP.  Since  column  ordering  is 
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VFGh'S  subroutines 


Symbolic  VMSP 

VSORT 

Numeric 

Factorization  (Fortran)  VMNP 

Fact.  (Fortran  - Assembly,  VMNPF 

column- ordered)  VMNPA 

Column- ordered  substitutions  VMBPC 

Row-ordered  substitutions  VN1BPR 

VF.GES/P  subroutines 

Symbolic  ZMSP 

VSORT 

ZBRFAK 

ZMSPO 

Numeric 

Factorization  ZMNP 


ZMNPI 

ZMNPA 

ZMNPB 

ZMNPO 


Column-ordered  substitutions  ZMBPC 

Row-ordered  substitutions  ZMBPR 

I/O  routines  ZI.TB 


Symbolic  and  numeric  preprocessing  UNBLOK 

into  column- ordered  vectorized  VS0RT4 

lists  from  randomly- ordered  (i,j),  ABLOK 

block, or  finite  element  lists 


Table  16.  Subroutine  lists  for  factorization 


used,  standard  LI)  factorization  is  carried  out  using  Lq.  (12). 
Returned  arc  packed  numeric  arrays  I,  and  U corresponding  to  IVL 


and  IVU,  respectively.  Also  returned  is  DI,  an  array  of  the  inverse 
pivot  elements. 

Looping  through  each  column  , the  symbolic  bounds  for  this 
column  in  U and  L are  picked  up  from  JU  and  JL.  These  point  to 
this  column  in  IIJ  and  IL.  The  bounds  of  this  (permuted)  column 
in  A are  retrieved.  Calculations  are  done  on  a full  column  length 
(unpacked)  numeric  array  illustrated  in  Figure  4a.  Row  permuted 
A values  are  copied  into  this  (initially  zero)  scratch  array.  We 
now  loop  through  each  U position  in  this  column.  The  numeric 
multiplier  is  copied  into  U and  bounds  on  the  corresponding  L 
column  are  examined.  Executing  as  a vector  instruction,  the  packed 
L column  times  the  numeric  multiplier  is  put  in  another  scratch 
array  T*.  Looping  through  each  L vector,  values  from  T are  sub- 
tracted from  the  current  X column  as  in  Figure  3 . After  all  U 
elements  in  X have  been  considered,  the  inverse  of  the  diagonal 
is  stored  in  DI.  Remaining  L vectors  in  X are  multiplied  by  the 
inverse  diagonal  and  stored  in  packed  form  in  L.  This  process  is 
executed  for  each  column  of  the  matrix. 

3.  Forward  and  back  substitution 


This  stage  of  the  solution  solves  the  two  systems  L y = b and 
U x = y_.  Given  numeric  and  symbolic  L and  U arrays  and  DI  and  B, 
the  right  hand  side,  VMRPC  employs  Equations  13  and  14  to  solve  the 


*This  describes  the  option  when  separated  multiply  and  subtract 
instructions  are  used. 
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forward  and  back  substitution.  The  solution  vector  x is  returned 
in  B after  the  appropriate  permutations. 

The  right  hand  side  is  first  row  permuted  into  a scratch  vector 
X.  Looping  through  all  but  the  last  column  in  the  forward  sub- 
stitution, for  column  J,  X(J)  times  the  packed  L column  is  stored 
in  the  temporary  array  T in  a vector  instruction.  This  packed  T 
array  is  subtracted  from  the  X vector,  again  as  in  Figure  3.  Com- 
pleting the  forward  substitution,  the  X vector  now  contains 
The  back  substitution  proceeds  similarly  from  the  last  column 
through  the  second  column,  though  X(J)  is  first  multiplied  by  the 
inverse  diagonal.  The  solution  in  X is  last  transferred  to  B 
using  the  column  permutation. 


4.  Conclusions 

Figure  11  is  a flow  diagram  for  this  non-partitioned  program, 
showing  input  and  resultant  arrays.  The  symbolic  processing  is 
executed  first  and  is  independent  of  any  numeric  arrays.  The 
result  of  this  is  an  L U map  which  can  be  used  for  any  matrix  of 
this  size  and  structure.  Next  the  numeric  factorization  and 
forward  and  back  substitution  are  carried  out  on  the  A matrix 
and  right  hand  side.  Note  that  for  different  A numerical  values 
only  VMNP  and  VMBPC  must  be  repeated,  and,  when  B changes,  only 
VMBPC  need  be  repeated.  If  the  matrix  is  originally  given  in 
row  order  , VMBPR  is  used  in  place  of  VMBPC. 

G.  Fortran  Implementation  (VEGES/P) 

1.  Symbolic  processing 

Symbolic  processing  in  VEGES/P  does  not  consider  the  matrix 

(86) 


partitioned, assuming  that  all  symbolic  descriptors  can  all  fit 
into  local  stores.  This  assumption  aids  the  partitioning  process 
considerably,  altnough  possibly  restricting  the  size  of  problems 
which  may  be  solved. 

The  symbolic  part  of  VEGES/P  does  not  produce  separate  pointer 
arrays  into  L and  U,  so  that  vectors  may  cross  the  diagonal.  This 
aids  numeric  computation  in  the  factorization  routines,  as  fill 
vectors  are  not  broken  unnecessarily.  Symbolic  vectors  are  describ- 
ed in  IX,  a combination  of  IU  and  IL  in  VEGES.  JU  and  JL  point 
into  IX,  with  JL  pointing  to  the  first  vector  that  runs  into  L. 

IVU  and  IVL  point  into  separate  L and  U arrays.  JL  and  IVL  will 
not  be  passed  to  the  numeric  routines  because  the  diagonal  can 
be  easily  recognized  by  comparing  the  row  index  with  the  column 
number  while  stepping  through  U vectors,  thus  saving  the  I/O  necessary 
to  transfer  these  arrays.  Besides  these  symbolic  descriptors,  arrays 
LA  and  KA  are  passed  to  the  numeric  routines;  these  correspond  to 
JA  and  IA  except  that  row  and  column  permutations  have  been  applied. 

2.  Partitioning 

Results  from  symbolic  processor  ZMSP  are  passed  to  an 
interactive  routine  ZBREAK,  which,  on  the  basis  of  an  entered 
maximum  variable  array  storage,  calculates  where  the  matrix  should 
be  partitioned.  The  corresponding  buffer  size  for  each  of  these 
strips  is  displayed,  as  well  as  the  number  of  the  first  recalled 
I,  strip  needed  for  factorization  of  each  current  strip.  Maximum 
overall  buffer  sizes  are  printed  for  use  in  accurate  dimensioning 
of  a factorization  and/or  substitution  driver  program.  An  alterna- 
tive to  this  automatic  partitioning  is  also  available;  here  the 
number  of  the  first  column  of  each  strip  is  entered.  The  total 
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number  of  bytes  read  and  written  in  ZMNP  is  also  presented, 
giving  an  indication  of  the  effect  of  breaking  on  I/O  operations. 

Once  a set  of  partitions  is  decided  upon,  ZMSPO  is  called  to 
write  out  the  two  symbolic  data  files,  IAUNIT  and  INUNIT.  IAUNIT 
contains  information  needed  to  reformat  the  A matrix  into  numeric 
buffers  in  ZMNPI.  These  arrays  are  JA,  IA,  LA,  KA,  and  KVA.  Each 
buffer  contains  the  section  of  these  arrays  pertaining  to  that 
strip.  Note  that  strips  are  in  the  permuted  matrix.  JA  and  LA  are 
adjusted  to  point  into  the  buffer  locations  of  IA  and  KA,  respect- 
ively. Vector  KVA  is  created  here  and  points  into  the  X numeric 
buffer  (combined  L and  U strips)  to  be  written  on  JAUNIT,  indicat- 
ing the  start  of  every  permuted  A vector  so  that,  when  the  buffer 
is  initialized,  values  can  be  easily  transferred.  INUNIT  is  read 
by  ZMNP  and  provides  the  symbolic  description  for  each  X block. 
First,  however,  permutation  vectors  IPC  and  IPRI  are  written  out 
for  use  in  ZMNPI  and  ZMBPC.  To  describe  the  X block,  arrays  IVU, 
JU,  and  IX  are  broken  and  written  out.  IVU  is  adjusted  to  point 
into  the  X numeric  buffer  and  JU  to  point  into  IX  in  the  buffer. 
Header  information  is  written  out  with  each  buffer,  including 
buffer  number,  column  range,  and  first  L strip  for  the  numeric 
factorization  and  substitution  routines. 

3 . Strip  numeric  factorization 

The  numeric  factorization  process  is  broken  up  into  several 
steps  (Fig  23).  Subroutine  ZMNPI  is  called  once,  initializing  the 
X numeric  buffers.  Permutation  vectors  are  passed  to  ZMNPI,  which 
reads  symbolic  information  from  IAUNIT.  The  A matrix  is  input  on 
file  NAUNIT  in  unpermuted  order,  written  one  column  per  logical 
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record.  Pile  pointers  to  each  column  are  read  in  from  T BUN  IT. 

(These  file  pointers  were  picked  up  when  the  matrix  was  being 
written,  and  are  the  responsibility  of  the  user.  Also  on  IBUNIT 
is  the  right  hand  side,  to  be  read  by  ZMBPC.  Note  that  with  this 
method  of  storing  the  A matrix,  only  one  column  must  be  in  memory 
at  a time,  but  other  methods  are  possible,  depending  on  the  limit- 
ations of  the  system.)  Strip  by  strip,  the  matrix  is  permuted  and 
fill  positions  are  set  to  zero.  The  resultant  X numeric  buffer  is 
copied  out  to  JAUNIT,  to  be  read  later  in  ZMNP.  All  partitions  are 
processed  at  this  time  so  that  buffer  space  needed  here  can  be  over- 
lapped with  space  used  in  the  factorization. 

4.  Numeric  factorization 

Factorization  takes  place  one  strip  at  a time.  ZMNP  reads  symbolic 
and  numeric  information  from  INUNIT  and  JAUNIT,  respectively,  before 
passing  control  to  ZMNPA.  This  routine  is  given  the  number  of  the 
first  L strip  needed  and  an  array  of  pointers  to  the  start  of  L 
strips  written  on  ILUNIT.  The  last  array  position  set  points  to  the 
current  strip,  or  an  end-of-file,  since  the  current  strip  has  not 
yet  been  factored.  (Note  that  if  a file  system  has  no  pointing 
facility,  ILUNIT  can  be  rewound  and  buffers  read  starting  from  number 
one.  The  price  paid  is  some  unnecessary  I/O,  depending  on  the  matrix 
structure.)  Having  read  an  L buffer,  ZMNPA  performs  the  normal 
numeric  computations  which  involve  only  these  two  ( X and  L)  parts 
of  the  matrix.  L buffers  are  read  and  computations  performed 
until  an  end-of-file  (up  to  the  current  strip)  is  encountered  on 
ILUNIT.  ZMNPB  is  called  to  perform  numeric  computation  within  the 
current  X strip  and  put  the  inverse  diagonal  in  D I ; the  result  is 
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a completely  factored  strip  of  the  L U matrix.  In  ZMNP  numeric 
computations  there  are  only  two  differences  from  VMNP.  Instead 
of  operating  on  an  expanded  (unpacked)  column  before  packing  it 
into  the  L and  U arrays  operations  arc  carried  out  on  the  already 
packed  X numeric  buffer.  The  packed  buffer  option  was  specified 
by  the  preformatting  of  the  X numeric  buffers.  Symbolic  and 
numeric  pointers  have  been  kept  for  each  column  pointing  into 
the  U part  of  the  X block.  As  we  step  through  U,  these  end  up 
pointing  to  the  start  of  L.  These  are  effectively  IVL  and  JL. 

With  IVL,  JL,  IVU,  JU,  and  IX,  we  have  the  information  to 
separate  the  X block  into  L and  U and  write  these  buffers  out 
as  ILUNIT  and  IUUNIT,  respectively.  This  is  done  in  ZMNPO.  IX 
is  broken  at  the  diagonal,  JL  and  JU  adjusted  to  point  into  their 
respective  parts,  and  likewise  the  numeric  buffer  is  broken  and 
IVL  and  IVU  adjusted.  Buffers,  with  appropriate  header  information, 
are  written  out  and  the  position  of  the  ILUNIT  file  is  noted  for 
use  in  ZMNPA.  This  completes  the  operation  on  this  strip,  and 
the  next  symbolic  and  numeric  buffers  are  read  in. 

5 . Forward  and  back  substitution 

ZMBPC  is  input  permutation  vectors  IPC  and  IPRI,  data  files 
ILUNIT,  IUUNIT  and  IBUNIT,  DI  and  vector  B,  scratch  vector  X 
and  space  for  any  buffer  on  the  L and  U data  files.  ZMBPC  performs 
the  forward  substitution  on  L buffers,  reading  1 hem  sequeii  t i a I I y 
from  ILUNIT.  In  the  back  substitution,  however,  the  U buffers  must 
be  retrieved  in  reverse  order.  This  can  be  accomplished  in  various 
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ways,  using  either  IBM  Fortran  direct  I/O,  the  BACKSPACE  statement, 
or  some  system  routine  taking  advantage  of  a particular  file  sys- 
tem. 

The  driver  program  included  with  the  vectorized  sparse  matrix 
package  creates  a sequence  of  randomly-positioned  matrices  with 
randomly- selected  pivot  orders.  It  is  intended  as  a test  program, 
checking  the  solution  vector  against  one  generated  with  the  matrix 
and  flagging  errors;  it  also  gives  an  illustration  of  program  flow 
and  subroutine  calls. 

A similar  driver  and  matrix  generator  is  included  with  the 
partitioning  package,  with  the  addition  that  matrix  break  points 

are  randomly  generated,  and  the  routine  interpreting  these  breaks 
has  been  made  non- interactive  for  convenience  in  running  many 
matrices . 

The  above  program  can  be  used  as  a non- interactive  test  program, 
but  by  setting  a flag,  the  break  point  generation  will  be  bypassed 
and  full  interaction  with  the  breaking  routine  is  possible. 
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Figure  22.  Symbolic  factorization  flow  chart  for  partitioned 
matrix  solver 
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Figure  23.  Numeric  factorization  and  substitution  flow  chart 
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Appendix  A 

Numerical  Experiments 


I.  Introduction 

Two  dominant  issues  related  to  the  efficiency  of  a vectorized 
sparse  equation  solver  are 

1)  matrix  size  and  density, 

2)  the  compatibility  of  matrix  structure  and 
computer  architecture. 

Therefore,  an  experimental  study  should  involve  a class  of  matrices 
of  increasing  size,  with  documented  densities  and  structural 
regularity . 

Although  these  properties  can  be  synthesized  relatively  inde- 
pendently of  one  another  by  generating  variants  of  randomly- 
positioned  matrices,  such  studies  are  often  viewed  as  unrepresent  - 
arive  of  commonly- encountered  sparse  matrices.  For  this  reason, 
the  matrices  in  this  study  are  either  taken  directly  from  an 
application  or  synthesized  according  to  grid  (mesh)  generation 
rules  associated  with  finite  element  problems,  solved  by  dissection 
methods  [1] . 

II.  Finite-element  Matrices 

Illustrated  in  Figure  Alfa),  the  nodes  of  a rectangular  2- 
dimensional  linear  finite  element  grid  are  numbered  in  a prescribed 
manner  [1]  so  as  to  yield  local  decoupling  of  rows  and  columns  of 
the  associated  matrix,  the  local  coupling  replaced  by  a distributed 
coupling  throughout  the  matrix.  This  dissection  process  proceeds 
routinely  so  long  as  the  number  of  nodes/side  is  2n-l,  and  is 
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Figure  A1 . Simple  dissected  grid  and  matrix 
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Larger  dissected  grid 

illustrated  for  a larger  grid  in  figure  A2. 


Due  to  the  regular  structure  of  this  finite  element  family,  it 
is  possible  to  obtain  exact  arithmetic  operation  counts  and  estimates 
of  the  vector  counts  involved  in  the  matrix  factorization.  The 
asymptotic  components  of  these  formulae  (for  large  grids)  are  given 
in  Table  A1 , together  with  exact  counts  obtained  from  experimental 
solutions  of  the  matrices.  This  is  intended  to  establish  the 

validity  of  the  formulae  used  throughout  the  report  to  estimate  If 

computational  complexities  of  problems  beyond  the  range  of  the 
experiments . 

Tables  A2  and  A3  give  a variety  of  structural  properties  and 
execution  times  for  n = 2, 3, 4, 5.  Most  of  the  results  are  cited 
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in  context  throughout  the  report.  Also  included  are  the  timings 
and  memory  requirements  associated  with  two  early  procedures 
[6]  [9]  for  solving  relatively  small  sparse  systems,  namely  the  linear 
code  and  interpretive  list  results  of  the  first  and  second  data 
columns.  These  procedures  are  discussed  briefly  on  page  5 . The 
observer  will  note  the  extreme  speed  and  large  memory  requirements  of 
the  former,  and  the  lack  of  advantage  in  either  speed  or  memory  of  the 
latter.  Other  comparisons  are  given  in  [2]  [11]. 

For  the  partitioned  solution,  the  fractions  of  the  total  solution 
time  devoted  to  computation,  I/O,  and  other  partition-related  over- 
head are  critical  to  the  evaluation  of  the  algorithm  and  its 
implememtat ion.  Table  A4  displays  these  timings  for  a number  of 
sizes  of  available  local  store  (S^)  . 

' I 


III.  Other  Sparse  Matrices 


Three  other  matrices  were  chosen  for  study  (Table  A5),  ranging 
from  a highly  sparse  but  large  power  system  problem  to  a finite 
element  problem  of  large  size  and  greater  density  than  the  family 
cited  in  Table  A2  . Credits  and  references  are  as  follows: 

1)  Electrical  power  system,  from  Mr.  Walter 
Snyder,  American  Electric  Power; 

2)  Three-dimensional  44-body  mechanism  model  of 
Boeing  747  landing  system  [21],  from  Mr.  Keith 
Brewer  of  the  Flight  Dynamics  Laboratory  of  Wright 
Patterson  Air  Force  Base; 

3)  Linear,  2 variable,  2 dimensional  finite  element 
model  of  MESFET  transistor  [22],  from  Dr.  John 
Barnes,  American  Microsystems. 
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Property  Asymptotic  Formula  Experimental  Error 

Formula  Evaluation  for  value  (per  cent) 
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Mat r i x 
1) i mens  ion 

A 

Mat  rix 
Storage 

L 

U 

Inner  Loop 
Computat ion* 

factorization  1;.  ti 

B.  Sub 

9 

2.88 

1.91 

2.62 

1.97  (2.31) 

2.25 

(2.62) 

49 

2.71 

2.74 

2.71 

2.92  (4.15) 

2.53 

(3.78) 

225 

2.87 

3.71 

2.97 

4.42  (6.69) 

3.34 

(5.29) 

961 

2.93 

4.83 

3.86 

6.85(10.6) 

3.86 

(7.07) 

^Presented  as:  successive  multiply-subtract (separate  multiply- 

subtract) 


(a)  Average  Vector  Lengths  (words) 

Table  A2.  Structural  properties  of 
finite  element  experimental  problems 


Matrix 

Dimension 

LU  Storage* 

(bytes) 

Column 
Density 
A LU 

% non- 

zero* 

Numerical  (*8) 

Symbolic  (*2) 

A 

LU 

9 

336 . 

62. 

5.44  2.33 

60. 5 

58.3 

49 

4,880. 

752  . 

7.36  6.22 

15.0 

25.9 

225 

40,848. 

5,376. 

3.21  11.3 

3.65 

10.1 

961 

267,280. 

31,712. 

8.61  17.4 

. 897 

3.62 

*Unit  diagonal  not  included 


(b)  Density 

Table  A2.  Structural  properties  of 
finite  element  experimental  problems 
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Table  A3.  Experimental  results  comparing  five  sparse  matrix  solution  methods  applied  to 
dissected  finite  element  grid;  times  in  milliseconds,  memory  in  bytes;  .Amdahl 
470  V/6,  Fortran  H,  double  precision 
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Table  A5.  Results  of  vectorized  solutions  of 
engineer ing- related  sparse  matrices 
(Amdahl  470  V/6,  Fortran  H) 
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Separate  multiply-subtract  Combined  multiply-subtract 
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Appendix  C.  Main  Programs  for  Solution 

of  Finite  Element  Grid  of  Figure  19 


IMPLICIT  REAL*8(A-H.0-Z) 


SYMBOLIC  MAIN  PROGRAM  FOR  EXAMPLE  OF  FINITE  ELEMENT 
PREPROCESSING.  SEE  VARIABLE  DESCRIPTIONS  INTERNAL  TO 
SUBROUTINES. 


DIMENSIONS  FOR  ROUTINE  UNBLOK 


INTEGER*2  IF'ART , LEN , I A , NA 
DIMENSION  IF'ART  ( 6 ) , I XP  ( 6 ) 

DIMENSION  NA< 104 ) 

DIMENSION  MAP  ( 1 000 > . LEN  < 1 000  ) , IF'B < 1000  > , JPB ( 1 000  > 
DIMENSION  JP1<47>  , IVA<47>  r JA<47)  »IA<260) 

INTEGER  N/46/ 

INTEGER  I VEC/O/ » NUMBUF/2/ » NF'ART /5/ , NBC/8/ 

DATA  IFART/1 , 11 ,21 ,31 ,41 .47/ 

DATA  IXP/1 .25.49 . 73.97.97/ 

DATA  NA/1.2.3.3.4.5.1 .2.6.2.3.8.3.4.8.4.5.10. 

C 2.6.8.4.8.10. 

6.7.8.8.9.10.6.7.11.7.8.13.8.9.13.9.10.15. 
7,11,13.9.13.15, 

11,12,13,13,14,15.11,12,16,12,13,18,13,14,18, 
14,15,20,12,16,18,14,18,20, 

16,17,18,18,19,20,16,17,21,17,18,22,18,19,22, 
19,20,23,17,21,22,19,22,23, 

1,2,5,9,41,43,45.46/ 

INTEGER  IMUNIT/O/ 

COMMON  TO  ROUTINE  UNBLOK 

COMMON  /FEL/  NOAR.NNODE 
COMMON  TO  ZLIB  I/O  ROUTINES 
COMMON  /ZLEN/  MAXLEN 


C 

c 

c 

c 

c 

c 

c 


DIMENSIONS  FOR  ZMSP,  ZBREAK.  ZMSPO 


1NTEGER*2  IX . IXT , IXB , IP . ICNT , IPC , IPRI ,KA» IXBUFF 
DIMENSION  I BREAK  < 51 ) 

DIMENSION  IF’C  ( 46  ) , IF’RI  < 46  ) 

DIMENSION  ISYMF'L  ( 46  > , IS YMF'U ( 46  > , IKVA  < 46 ) , IUF'0S<46) 

DIMENSION  JU(47) »JL(47> , I VU ( 47 ) , I VL (47) ,LA<47> 

DIMENSION  IF’TR  ( 47 ) ,IXT (47) ,IXB<47) , IP(47) 

DIMENSION  KA(260),IX(300) 

DIMENSION  JXBUFF  < 500 ) , IXBUFF ( 1000) 

EQUIVALENCE  (I XBUFF , JXBUFF ) 

DATA  IF'C/ 1,2, 3. 4, 5, 6, 7. 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 

C 21,22,23,24,25,26,27,28,29.30,31,32,33,34.35,36,37,38.39, 

C 40,41,42,43,44,45,46/ 

DATA  IPRI/1, 2, 3, 4, 5, 6, 7. 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20. 
C 21.22,23,24,25,26,27,28,29.30,31,32,33,34,35,36,37,38,39,40, 
C 41,42,43,44,45,46/ 

INTEGER  IAUNIT/1/, INUNIT/2/ 

INTEGER  MAXCNT / 1 / , MAXK A/260/ » MAXXS/300/ 

COMMON  TO  ROUTINE  ZBREAK 

COMMON  /SIZE/  MAXTIB , MAX TIX , MAXTA, MAXTZB 
COMMON  TO  ROUTINE  ZBREAK 
COMMON  /INTACT/  IF'R 


NVAR=  2 
NNODE=  3 
MAXLEN=  32758 


NUMERIC  BUFFER  SIZES  - NOT  APPLICABLE  IF 
SYMBOLIC  AND  NUMERIC  RUN  SEPARATELY 


MAXTIB*  10000 
MAXTIX*  10000 
MAXTA*  10000 
MAXTZB*  10000 


IPR*  1 
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SYMBOLIC  FINITE  EICMENT  AND  BLOCK  F'REF  K0CES5 1 NG 

CALL  UNBLOK(N,NA.  IXP.NBC.NPART.  IPART  * I VEC  , NUMBUF.  IMUNI1  r 
C IPB.  JPB.  JP1  ,K1  .M»\F  .LLN.  I AD  IN.  IVA,  JA>  IA> 

WRITE  A NUMERIC  BUFFER  DIMENSION 


ID=  IADIM*NUMBUF 
PRINT  1 > I D 

1 FORMAT ( 'ODIMENSION  A(',I5.'>'> 


SYMBOLIC  MATRIX  FACTORIZATION 

CALL  ZMSP(N, JA. I A, IVA. JU. JL » IX , I VU . I VL. IXT » I X8 » IP, 

C ICNT.HAXXS.MAXCNI , IPC , IPRI .LA . KA . MAXKA > 

INTERACTIVE  MATRIX  PARTITIONING 

CALL  Z BREAK ( N , I VA » JA , LA » JU. JL , IX » I VU, I VL » NBRKS • I BREAK  » 

C ISYMFL  , ISYMPU, IKVA. IUFOS, IPTRf  *2) 

OUTPUT  SYMBOLIC  MATRIX  PARTITIONS 

CALL  ZM5F0(N,IVA.IA.JA»LA.KA.JU.IX.IVU,1VL.IPC. IPRI. I BREAK 
C NBRKS. IXBUFF. JXBUFF. ISYMPL. ISYHFU.  IFTR. IAUNI T . INUNIT > 

2 STOP 
END 

IMPLICIT  REAL*S( A-H.O-Z) 

NUMERIC  MAIN  PROGRAM  FOR  EXAMPLE  OF  FINITE  ELEMENT 
PREPROCESSING.  SEE  VARIABLE  DESCRIPTIONS  INTERNAL  TO 
SUBROUTINES. 

DIMENSIONS  FOR  ROUTINE  ABLOK 

INTEGER*?  IPART . LEN.NB 
DIMENSION  IPART < 6 ) .MAPI 300 ) » NB( t6) 

DIMENSION  I VA ( 47  > 

DIMENSION  A < 248 ) . B < 46  > 

INTEGER  N/46/ 

INTEGER  IMUNIT/O/ .NAUNIT/3/ r IBUNIT/A/ .NBUNIT/5/ 

DIMENSIONS  FOR  ROUTINES  ZMNP.  ZMBPR 


INTEGER*?  IPC. IPRI . IXBUFF . IBBUFF 
DIMENSION  IPC<4&> »IPRI<46> rIPTR(46> 

DIMENSION  DIF 4A ) r X ( 46 > 

DIMENSION  XBUFF ( 308 ) .JXBUFF (614) , IXBUFF (1232) 
DIMENSION  XBUFF I ( 223  > . JBBUFF ( 4*6 ) . IBBUFF<892> 
EQUIVALENCE  ( I XBUFF . JXBUFF .XBUFF , A ) 


EQUIVALENCE  ( IBBUFF . JBBUFF, XBUFFI , IXBUFF<73>  > 
INTEGER  IAUNIT/l/, INUNIT/2/ 

INTEGER  JAUNIT/8/. ILUNI T/9/ , IUUNIT/10/ 

COMMON  TO  ZLIB  I/O  ROUTINES 


COMMON  /Zl.EN/  MAXLEN 


MAXLEN=  32758 
NUMERIC  FORMULATION 

CALL  NF  INI  < N , A , NBUNIT  ) 

NUMERIC  FINITE  ELEMENT  AND  BLOCK  PREPROCESSING 

CALL  ABLOK (N.NPART. IPART , NB . MAP , LEN , IVA.A.B, 

C IMUNIT. NBUNIT, NAUNIT.IBUNIT) 

NUMERIC  MATRIX  FACTORIZATION 

CALL  ZMNP ( N , A , I BBUFF , JBBUFF , XBUFFI , I XBUFF , JXBUFF , XBUFF , IPC 
C IPRI.DI , IF'TR,  IAUNI  T.NAUNIT,  I BUNIT  , JAUNI T . INUNIT.  ILUNIT. 

C IUUNIT) 

ROW  ORDER  FORWARD  AND  BACK  SUBSTITUTION 

CALL  ZMBPR (N. IXBUFF. JXBUFF, XBUFF, IBUNIT, ILUNIT. IUUNIT. 

C DI.B.X, IPC. IPRI) 

WRITE  SOLUTION  VECTOR 

PRINT  1»<B(I>.I=1.N) 

1 FORMAT ( ' OSOLN  VEC '/(5G14.5)) 

STOP 

END 
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